Skip to content

Production-Minded Multi-Agent Orchestration in Java

The demo works. Your two-agent research-writer pipeline produces a decent article. Your hierarchical team generates a plausible report. Someone on the team says “let’s ship it.”

And then reality sets in.

How many tokens did that run consume? What happens when the LLM returns garbage JSON? Can we rate-limit calls so we don’t blow our API budget on day one? What if a task takes 90 seconds and produces hallucinated nonsense — does a human get to review it before it hits the customer?

These aren’t edge cases. They’re the difference between a demo and a deployment. This post covers what production multi-agent orchestration actually requires, and how AgentEnsemble handles each concern.

Observability: Know What Your Agents Are Doing

Section titled “Observability: Know What Your Agents Are Doing”

Every significant event in an ensemble run fires a callback. You register listeners on the builder:

Ensemble.builder()
.agents(researcher, writer)
.tasks(researchTask, writeTask)
.chatLanguageModel(model)
.listener(event -> {
switch (event) {
case TaskStartEvent e ->
logger.info("Starting: {}", e.taskDescription());
case TaskCompleteEvent e ->
logger.info("Completed: {} ({}ms, {} tokens)",
e.taskDescription(), e.durationMs(), e.tokenCount());
case TaskFailedEvent e ->
logger.error("Failed: {} - {}", e.taskDescription(),
e.errorMessage());
case ToolCallEvent e ->
logger.info("Tool call: {} -> {}", e.toolName(),
e.result());
default -> {} // other events
}
})
.build()
.run();

Events are typed. You pattern-match on them. No string parsing, no event name constants.

You can register multiple listeners, and they’re exception-safe — if one listener throws, the others still fire and the ensemble keeps running.

For more structured collection, implement the EnsembleListener interface:

public class MetricsCollector implements EnsembleListener {
private final Map<String, Long> taskDurations = new ConcurrentHashMap<>();
@Override
public void onTaskComplete(TaskCompleteEvent event) {
taskDurations.put(event.taskDescription(), event.durationMs());
}
@Override
public void onToolCall(ToolCallEvent event) {
meterRegistry.counter("agent.tool.calls",
"tool", event.toolName()).increment();
}
}

AgentEnsemble integrates with Micrometer out of the box. Add the metrics module:

implementation("net.agentensemble:agentensemble-metrics-micrometer:2.3.0")

Then wire it up:

MeterRegistry registry = new PrometheusMeterRegistry(
PrometheusConfig.DEFAULT);
Ensemble.builder()
.agents(researcher, writer)
.tasks(researchTask, writeTask)
.chatLanguageModel(model)
.meterRegistry(registry)
.build()
.run();

You get counters for token usage, task completions, task failures, and tool calls. Gauges for active tasks. Timers for task and ensemble duration. All tagged with agent role and task description for Grafana/Prometheus filtering.

For post-mortem analysis, export full execution traces as JSON:

EnsembleOutput output = Ensemble.builder()
.agents(researcher, writer)
.tasks(researchTask, writeTask)
.chatLanguageModel(model)
.traceExporter(TraceExporter.json(Path.of("traces/")))
.build()
.run();
// Or access the trace programmatically
ExecutionTrace trace = output.getTrace();
trace.getSpans().forEach(span ->
System.out.printf("%s: %dms, %d tokens%n",
span.getName(), span.getDurationMs(), span.getTokenCount()));

Each trace includes spans for every task, tool call, and LLM interaction. Duration, token counts, input/output payloads — all structured, all queryable.

Know what you’re spending:

Ensemble.builder()
// ...
.costConfiguration(CostConfiguration.builder()
.inputTokenCostPer1k(0.01)
.outputTokenCostPer1k(0.03)
.build())
.build()
.run();
EnsembleMetrics metrics = output.getMetrics();
System.out.printf("Total cost: $%.4f%n", metrics.getTotalCost());
System.out.printf("Input tokens: %d, Output tokens: %d%n",
metrics.getInputTokens(), metrics.getOutputTokens());

You configure per-model pricing, and the framework tracks token consumption and computes costs across all tasks in the run.

Agents can get stuck in loops — calling the same tool repeatedly, or generating output that fails validation. Cap it:

Agent researcher = Agent.builder()
.role("Researcher")
.goal("Find information about {{topic}}")
.maxIterations(10) // default is 25
.build();

When the limit is hit, the agent returns its best output so far rather than running forever.

When a task has outputType() set and the LLM returns invalid JSON, the framework retries automatically:

Task profileTask = Task.builder()
.description("Create a competitor profile")
.expectedOutput("Structured JSON profile")
.agent(analyst)
.outputType(CompetitorProfile.class)
.maxOutputRetries(3) // retry up to 3 times on parse failure
.build();

In parallel workflows, one failing task shouldn’t necessarily kill the entire ensemble:

Ensemble.builder()
.agents(marketAnalyst, financialAnalyst, strategist)
.tasks(marketTask, financialTask, summaryTask)
.chatLanguageModel(model)
.workflow(Workflow.parallel()
.errorStrategy(ParallelErrorStrategy.CONTINUE_ON_ERROR)
.build())
.build()
.run();

With CONTINUE_ON_ERROR, successful tasks still complete, and downstream tasks receive whatever results are available. The ensemble output tells you which tasks succeeded and which failed.

Validate inputs and outputs at the framework level:

Agent agent = Agent.builder()
.role("Content Writer")
.goal("Write marketing content")
.inputGuardrail(input -> {
if (input.contains("competitor")) {
return GuardrailResult.reject(
"Input must not reference competitors");
}
return GuardrailResult.accept();
})
.outputGuardrail(output -> {
if (output.length() > 5000) {
return GuardrailResult.reject("Output exceeds length limit");
}
return GuardrailResult.accept();
})
.build();

Guardrails run before/after each agent iteration. Rejection stops the agent and surfaces a clear error, not a silent failure.

Protect against runaway API costs:

Ensemble.builder()
.agents(researcher, writer)
.tasks(researchTask, writeTask)
.chatLanguageModel(model)
.rateLimit(RateLimit.builder()
.maxRequestsPerMinute(60)
.build())
.build()
.run();

The rate limiter applies across all agents in the ensemble. Requests that exceed the limit are queued, not dropped.

In hierarchical workflows, prevent infinite delegation chains:

Ensemble.builder()
// ...
.workflow(Workflow.HIERARCHICAL)
.maxDelegationDepth(3)
.build()
.run();

The most production-critical feature is often the simplest: letting a human review agent output before it’s treated as final.

Ensemble.builder()
.agents(researcher, writer)
.tasks(researchTask, writeTask)
.chatLanguageModel(model)
.reviewHandler(taskOutput -> {
System.out.println("Agent produced:\n" + taskOutput.getRaw());
System.out.print("Approve? (y/n/edit): ");
String response = scanner.nextLine();
if (response.equals("y")) {
return Review.approve();
} else if (response.equals("n")) {
return Review.reject("Quality below threshold");
} else {
System.out.print("Enter corrected output: ");
return Review.edit(scanner.nextLine());
}
})
.reviewPolicy(ReviewPolicy.REVIEW_ALL)
.build()
.run();

Review policies let you control which tasks trigger review:

  • REVIEW_ALL — every task output goes through review.
  • REVIEW_FAILED — only tasks that hit errors or retries.
  • FIRST_TASK_ONLY — review the first output to calibrate, then let the rest run.

You can also add pre-flight validation with beforeReview():

Ensemble.builder()
// ...
.beforeReview(taskOutput -> {
// Automated quality check before human review
if (taskOutput.getRaw().length() < 100) {
return Review.reject("Output too short, re-running");
}
return Review.skip(); // passes automated check, proceed to human
})
.build()
.run();

Production confidence starts with testability. AgentEnsemble’s capture mode records full execution data:

EnsembleOutput output = Ensemble.builder()
.agents(researcher, writer)
.tasks(researchTask, writeTask)
.chatLanguageModel(model)
.captureMode(CaptureMode.FULL) // record everything
.traceExporter(TraceExporter.json(Path.of("test-traces/")))
.build()
.run();

CaptureMode.FULL records:

  • Full LLM message history per iteration (system prompt, user messages, assistant responses)
  • Tool call inputs and outputs
  • Token counts per interaction
  • Memory operations
  • Timing data

This gives you deterministic test fixtures. Capture a run once, then write assertions against the trace without making live LLM calls.

Here’s a quick reference for what to enable before shipping:

ConcernConfiguration
LoggingSLF4J is used throughout; configure your logging framework
Event callbacks.listener() for real-time monitoring
Metrics.meterRegistry() with Micrometer
Traces.traceExporter() for structured JSON
Cost tracking.costConfiguration() with your model’s pricing
Rate limiting.rateLimit() to protect API budgets
Error strategyParallelErrorStrategy.CONTINUE_ON_ERROR for resilience
Review gates.reviewHandler() + .reviewPolicy() for human oversight
Guardrails.inputGuardrail() / .outputGuardrail() on agents
Testing.captureMode(CaptureMode.FULL) for trace capture

None of these require custom infrastructure. They’re all builder methods on the same API you already use to define agents and tasks.

Production-grade agent orchestration isn’t about adding a monitoring sidecar after the fact. It’s about choosing a framework that was built for production from the start.

Get started:


AgentEnsemble is MIT-licensed and available on GitHub.