Production-Minded Multi-Agent Orchestration in Java
The demo works. Your two-agent research-writer pipeline produces a decent article. Your hierarchical team generates a plausible report. Someone on the team says “let’s ship it.”
And then reality sets in.
How many tokens did that run consume? What happens when the LLM returns garbage JSON? Can we rate-limit calls so we don’t blow our API budget on day one? What if a task takes 90 seconds and produces hallucinated nonsense — does a human get to review it before it hits the customer?
These aren’t edge cases. They’re the difference between a demo and a deployment. This post covers what production multi-agent orchestration actually requires, and how AgentEnsemble handles each concern.
Observability: Know What Your Agents Are Doing
Section titled “Observability: Know What Your Agents Are Doing”Every significant event in an ensemble run fires a callback. You register listeners on the builder:
Ensemble.builder() .agents(researcher, writer) .tasks(researchTask, writeTask) .chatLanguageModel(model) .listener(event -> { switch (event) { case TaskStartEvent e -> logger.info("Starting: {}", e.taskDescription()); case TaskCompleteEvent e -> logger.info("Completed: {} ({}ms, {} tokens)", e.taskDescription(), e.durationMs(), e.tokenCount()); case TaskFailedEvent e -> logger.error("Failed: {} - {}", e.taskDescription(), e.errorMessage()); case ToolCallEvent e -> logger.info("Tool call: {} -> {}", e.toolName(), e.result()); default -> {} // other events } }) .build() .run();Events are typed. You pattern-match on them. No string parsing, no event name constants.
You can register multiple listeners, and they’re exception-safe — if one listener throws, the others still fire and the ensemble keeps running.
For more structured collection, implement the EnsembleListener interface:
public class MetricsCollector implements EnsembleListener { private final Map<String, Long> taskDurations = new ConcurrentHashMap<>();
@Override public void onTaskComplete(TaskCompleteEvent event) { taskDurations.put(event.taskDescription(), event.durationMs()); }
@Override public void onToolCall(ToolCallEvent event) { meterRegistry.counter("agent.tool.calls", "tool", event.toolName()).increment(); }}AgentEnsemble integrates with Micrometer out of the box. Add the metrics module:
implementation("net.agentensemble:agentensemble-metrics-micrometer:2.3.0")Then wire it up:
MeterRegistry registry = new PrometheusMeterRegistry( PrometheusConfig.DEFAULT);
Ensemble.builder() .agents(researcher, writer) .tasks(researchTask, writeTask) .chatLanguageModel(model) .meterRegistry(registry) .build() .run();You get counters for token usage, task completions, task failures, and tool calls. Gauges for active tasks. Timers for task and ensemble duration. All tagged with agent role and task description for Grafana/Prometheus filtering.
For post-mortem analysis, export full execution traces as JSON:
EnsembleOutput output = Ensemble.builder() .agents(researcher, writer) .tasks(researchTask, writeTask) .chatLanguageModel(model) .traceExporter(TraceExporter.json(Path.of("traces/"))) .build() .run();
// Or access the trace programmaticallyExecutionTrace trace = output.getTrace();trace.getSpans().forEach(span -> System.out.printf("%s: %dms, %d tokens%n", span.getName(), span.getDurationMs(), span.getTokenCount()));Each trace includes spans for every task, tool call, and LLM interaction. Duration, token counts, input/output payloads — all structured, all queryable.
Cost Tracking
Section titled “Cost Tracking”Know what you’re spending:
Ensemble.builder() // ... .costConfiguration(CostConfiguration.builder() .inputTokenCostPer1k(0.01) .outputTokenCostPer1k(0.03) .build()) .build() .run();
EnsembleMetrics metrics = output.getMetrics();System.out.printf("Total cost: $%.4f%n", metrics.getTotalCost());System.out.printf("Input tokens: %d, Output tokens: %d%n", metrics.getInputTokens(), metrics.getOutputTokens());You configure per-model pricing, and the framework tracks token consumption and computes costs across all tasks in the run.
Error Handling: Fail Gracefully
Section titled “Error Handling: Fail Gracefully”Max Iterations
Section titled “Max Iterations”Agents can get stuck in loops — calling the same tool repeatedly, or generating output that fails validation. Cap it:
Agent researcher = Agent.builder() .role("Researcher") .goal("Find information about {{topic}}") .maxIterations(10) // default is 25 .build();When the limit is hit, the agent returns its best output so far rather than running forever.
Output Retries
Section titled “Output Retries”When a task has outputType() set and the LLM returns invalid JSON, the framework retries automatically:
Task profileTask = Task.builder() .description("Create a competitor profile") .expectedOutput("Structured JSON profile") .agent(analyst) .outputType(CompetitorProfile.class) .maxOutputRetries(3) // retry up to 3 times on parse failure .build();Parallel Error Strategy
Section titled “Parallel Error Strategy”In parallel workflows, one failing task shouldn’t necessarily kill the entire ensemble:
Ensemble.builder() .agents(marketAnalyst, financialAnalyst, strategist) .tasks(marketTask, financialTask, summaryTask) .chatLanguageModel(model) .workflow(Workflow.parallel() .errorStrategy(ParallelErrorStrategy.CONTINUE_ON_ERROR) .build()) .build() .run();With CONTINUE_ON_ERROR, successful tasks still complete, and downstream tasks receive whatever results are available. The ensemble output tells you which tasks succeeded and which failed.
Validate inputs and outputs at the framework level:
Agent agent = Agent.builder() .role("Content Writer") .goal("Write marketing content") .inputGuardrail(input -> { if (input.contains("competitor")) { return GuardrailResult.reject( "Input must not reference competitors"); } return GuardrailResult.accept(); }) .outputGuardrail(output -> { if (output.length() > 5000) { return GuardrailResult.reject("Output exceeds length limit"); } return GuardrailResult.accept(); }) .build();Guardrails run before/after each agent iteration. Rejection stops the agent and surfaces a clear error, not a silent failure.
Cost Control: Don’t Blow Your Budget
Section titled “Cost Control: Don’t Blow Your Budget”Protect against runaway API costs:
Ensemble.builder() .agents(researcher, writer) .tasks(researchTask, writeTask) .chatLanguageModel(model) .rateLimit(RateLimit.builder() .maxRequestsPerMinute(60) .build()) .build() .run();The rate limiter applies across all agents in the ensemble. Requests that exceed the limit are queued, not dropped.
Delegation Depth
Section titled “Delegation Depth”In hierarchical workflows, prevent infinite delegation chains:
Ensemble.builder() // ... .workflow(Workflow.HIERARCHICAL) .maxDelegationDepth(3) .build() .run();Human-in-the-Loop: Review Before You Ship
Section titled “Human-in-the-Loop: Review Before You Ship”The most production-critical feature is often the simplest: letting a human review agent output before it’s treated as final.
Ensemble.builder() .agents(researcher, writer) .tasks(researchTask, writeTask) .chatLanguageModel(model) .reviewHandler(taskOutput -> { System.out.println("Agent produced:\n" + taskOutput.getRaw()); System.out.print("Approve? (y/n/edit): ");
String response = scanner.nextLine(); if (response.equals("y")) { return Review.approve(); } else if (response.equals("n")) { return Review.reject("Quality below threshold"); } else { System.out.print("Enter corrected output: "); return Review.edit(scanner.nextLine()); } }) .reviewPolicy(ReviewPolicy.REVIEW_ALL) .build() .run();Review policies let you control which tasks trigger review:
REVIEW_ALL— every task output goes through review.REVIEW_FAILED— only tasks that hit errors or retries.FIRST_TASK_ONLY— review the first output to calibrate, then let the rest run.
You can also add pre-flight validation with beforeReview():
Ensemble.builder() // ... .beforeReview(taskOutput -> { // Automated quality check before human review if (taskOutput.getRaw().length() < 100) { return Review.reject("Output too short, re-running"); } return Review.skip(); // passes automated check, proceed to human }) .build() .run();Testing: Capture and Replay
Section titled “Testing: Capture and Replay”Production confidence starts with testability. AgentEnsemble’s capture mode records full execution data:
EnsembleOutput output = Ensemble.builder() .agents(researcher, writer) .tasks(researchTask, writeTask) .chatLanguageModel(model) .captureMode(CaptureMode.FULL) // record everything .traceExporter(TraceExporter.json(Path.of("test-traces/"))) .build() .run();CaptureMode.FULL records:
- Full LLM message history per iteration (system prompt, user messages, assistant responses)
- Tool call inputs and outputs
- Token counts per interaction
- Memory operations
- Timing data
This gives you deterministic test fixtures. Capture a run once, then write assertions against the trace without making live LLM calls.
The Production Checklist
Section titled “The Production Checklist”Here’s a quick reference for what to enable before shipping:
| Concern | Configuration |
|---|---|
| Logging | SLF4J is used throughout; configure your logging framework |
| Event callbacks | .listener() for real-time monitoring |
| Metrics | .meterRegistry() with Micrometer |
| Traces | .traceExporter() for structured JSON |
| Cost tracking | .costConfiguration() with your model’s pricing |
| Rate limiting | .rateLimit() to protect API budgets |
| Error strategy | ParallelErrorStrategy.CONTINUE_ON_ERROR for resilience |
| Review gates | .reviewHandler() + .reviewPolicy() for human oversight |
| Guardrails | .inputGuardrail() / .outputGuardrail() on agents |
| Testing | .captureMode(CaptureMode.FULL) for trace capture |
None of these require custom infrastructure. They’re all builder methods on the same API you already use to define agents and tasks.
Production-grade agent orchestration isn’t about adding a monitoring sidecar after the fact. It’s about choosing a framework that was built for production from the start.
Get started:
- Documentation — guides, examples, and API reference
- Getting Started — up and running in 5 minutes
- Examples — runnable code for every pattern
- GitHub — source, issues, and contributions
AgentEnsemble is MIT-licensed and available on GitHub.