Skip to content
AgentEnsemble AgentEnsemble
Get Started

Error Handling in Agent Systems: Exception Hierarchies, Partial Results, and Exit Reasons

Agent systems fail in ways that traditional software does not. An LLM might return an unparseable response. A tool call might timeout. An agent might enter an infinite ReAct loop. A human reviewer might walk away from an approval gate. A task might succeed but produce output that a downstream task cannot use.

The interesting problem is not preventing these failures — some are inherent to non-deterministic systems. The interesting problem is giving operators enough information to handle them gracefully: what failed, what succeeded before the failure, and what the system’s terminal state actually is.

AgentEnsemble uses a hierarchy of unchecked exceptions rooted at AgentEnsembleException. Every exception the framework throws extends this base, so you can catch everything with a single catch block or handle specific cases individually.

AgentEnsembleException (base)
ValidationException -- invalid configuration at build/run time
TaskExecutionException -- a task failed during execution
AgentExecutionException -- an LLM call failed
MaxIterationsExceededException -- agent exceeded its tool-call limit
PromptTemplateException -- unresolved template variables
ToolExecutionException -- a tool call failed
ConstraintViolationException -- required workers were not called
GuardrailViolationException -- a guardrail blocked execution

The hierarchy matters because different failure types require different responses. A ValidationException means your configuration is wrong — no LLM was ever called, and the fix is in the code. A TaskExecutionException means the pipeline started but a task failed — partial results may be available. A MaxIterationsExceededException means an agent got stuck in a tool-calling loop — the fix might be fewer tools or a higher iteration limit.

When a multi-task pipeline fails partway through, the work completed before the failure is not discarded. TaskExecutionException carries a list of TaskOutput objects for tasks that completed before the failure:

try {
EnsembleOutput output = ensemble.run(inputs);
saveResults(output);
} catch (TaskExecutionException e) {
// Save whatever was completed before the failure
for (TaskOutput partial : e.getCompletedTaskOutputs()) {
savePartialResult(partial);
}
alertOnFailure(e.getTaskDescription(), e.getAgentRole());
}

This is operationally significant. In a five-task pipeline where task four fails, you still have the outputs of tasks one through three. You can save them, display them to a user, or use them to resume the pipeline from where it left off.

Not every non-completion is an error. EnsembleOutput.getExitReason() distinguishes between four terminal states:

Exit ReasonMeaning
COMPLETEDAll tasks ran to completion normally
USER_EXIT_EARLYA human reviewer chose to stop the pipeline
TIMEOUTA review gate timeout expired
ERRORAn unrecoverable exception terminated the pipeline
EnsembleOutput output = ensemble.run();
switch (output.getExitReason()) {
case COMPLETED:
System.out.println("All done: " + output.getRaw());
break;
case USER_EXIT_EARLY:
System.out.println("User stopped after "
+ output.completedTasks().size() + " task(s)");
break;
case TIMEOUT:
System.out.println("Review gate timed out");
break;
case ERROR:
// Typically handled via exception
break;
}

The distinction between USER_EXIT_EARLY and TIMEOUT matters for operational dashboards. A user exit is intentional — the pipeline did its job and the human made a decision. A timeout might indicate a process problem (reviewer was not available) and may need escalation.

Thrown before any LLM calls when the ensemble or its components are configured incorrectly. Common causes include missing required fields, tasks referencing unregistered agents, circular context dependencies, or invalid iteration limits.

This exception is your build-time safety net. If you see it, the fix is always in the configuration code.

Thrown when the LLM call itself fails — network errors, API errors, rate limiting, timeouts. Contains the agent role and task description so you can route the failure to the right team.

Thrown when an agent exceeds its maxIterations limit during the ReAct loop. Contains both the configured limit and the actual iteration count.

This is often a sign that the agent has too many tools and is cycling between them without making progress. The fix is usually to reduce the tool set, make tool descriptions more specific, or increase the iteration limit if the task genuinely requires many tool calls.

Thrown when a task description contains {variable} placeholders that were not resolved. The exception lists the missing variable names, making it straightforward to fix.

Thrown when an input or output guardrail blocks execution. Contains the guardrail type (INPUT or OUTPUT), the violation message, the task description, and the agent role. This integrates with the guardrail system covered in the previous post.

AgentEnsemble does not include built-in retry logic. This is a deliberate design choice.

The reasoning is that retry policies are highly context-dependent. A rate-limited API call might benefit from exponential backoff. A malformed LLM response might benefit from a retry with the same prompt. A task that failed because the model cannot perform the requested work should not be retried at all.

For transient failures, implement retry at the call site:

int attempts = 0;
EnsembleOutput output = null;
while (attempts < 3) {
try {
output = ensemble.run(inputs);
break;
} catch (AgentExecutionException e) {
attempts++;
if (attempts == 3) throw e;
Thread.sleep(1000L * attempts);
}
}

For production use, consider integrating a resilience library such as Resilience4j, which provides circuit breakers, rate limiters, and retry policies that compose well with the exception hierarchy.

The error handling design reflects a particular view of how agent systems should be operated: failures are expected, partial results are valuable, and the framework should give you structured information rather than opaque error strings.

The exception hierarchy makes it possible to build monitoring and alerting that distinguishes between configuration errors (fix the code), transient failures (retry or escalate), agent loops (tune the workflow), and intentional stops (human decision). The partial result preservation makes it possible to build resumable pipelines. The exit reasons make it possible to build dashboards that accurately represent pipeline outcomes.

None of this prevents failures. It gives you the handles to respond to them systematically.


The full error handling guide is in the documentation.

I’d be interested in whether you have found the exception hierarchy granularity to be sufficient, or whether there are failure modes in your agent systems that do not map cleanly to these categories.