Debugging Multi-Agent Systems: Traces, Capture Mode, and Live Dashboards
Multi-agent systems are hard to debug.
It’s not the same as debugging a web request or a database query. You can’t set a breakpoint in the middle of an LLM call. You can’t predict what the model will say. When an agent produces bad output, you need to understand the full chain of events: what prompt was sent, what the model returned, which tools were called, what context from previous tasks was injected, and whether the output parsing succeeded.
Traditional debuggers don’t help here. You need purpose-built observability.
This post covers the debugging and observability stack in AgentEnsemble: structured traces for post-mortem analysis, capture mode for recording full execution state, and the live dashboard for real-time visibility during development.
The Debugging Challenge
Section titled “The Debugging Challenge”Consider a three-agent pipeline: Researcher, Analyst, Writer. The Writer produces a report that’s factually wrong. Where did things go wrong?
- Did the Researcher find bad information?
- Did the Analyst misinterpret the research?
- Did the Writer ignore the analysis and hallucinate?
- Did a tool call return unexpected results?
- Was the wrong context passed between tasks?
Without observability, you’re guessing. With it, you’re reading a log.
Layer 1: Structured Traces
Section titled “Layer 1: Structured Traces”The most broadly useful debugging tool is the structured trace. It records every significant event in an ensemble run as a tree of spans:
EnsembleOutput output = Ensemble.builder() .agents(researcher, analyst, writer) .tasks(researchTask, analysisTask, writeTask) .chatLanguageModel(model) .traceExporter(TraceExporter.json(Path.of("traces/"))) .build() .run();This produces a JSON file in the traces/ directory with a structure like:
Ensemble Run (total: 8,420ms, 5,230 tokens) | +-- Task: Research emerging trends (3,240ms, 1,847 tokens) | +-- LLM Call #1 (1,900ms, 1,200 tokens) | +-- Tool: WebSearch "emerging tech trends 2024" (890ms) | +-- LLM Call #2 (450ms, 647 tokens) | +-- Task: Analyze research findings (2,180ms, 1,583 tokens) | +-- LLM Call #1 (2,180ms, 1,583 tokens) | +-- Task: Write final report (3,000ms, 1,800 tokens) +-- LLM Call #1 (2,400ms, 1,400 tokens) +-- LLM Call #2 (600ms, 400 tokens) // output retryEach span records:
| Field | Description |
|---|---|
| Name | Task description or tool call name |
| Duration | Wall-clock time in milliseconds |
| Token count | Input + output tokens for LLM calls |
| Status | Success, failure, or retry |
| Input/Output | What went in, what came out |
Accessing Traces Programmatically
Section titled “Accessing Traces Programmatically”You don’t have to read the JSON file. The trace is available on the EnsembleOutput:
ExecutionTrace trace = output.getTrace();
// Walk the span treefor (TraceSpan span : trace.getSpans()) { System.out.printf("[%s] %s -- %dms, %d tokens%n", span.getStatus(), span.getName(), span.getDurationMs(), span.getTokenCount());
for (TraceSpan child : span.getChildren()) { System.out.printf(" [%s] %s -- %dms%n", child.getStatus(), child.getName(), child.getDurationMs()); }}This is useful for writing assertions in tests:
@Testvoid ensembleShouldCompleteAllTasks() { EnsembleOutput output = ensemble.run();
ExecutionTrace trace = output.getTrace(); assertThat(trace.getSpans()).hasSize(3); assertThat(trace.getSpans()) .allMatch(span -> span.getStatus() == TraceStatus.SUCCESS); assertThat(output.getMetrics().getTotalTokens()).isLessThan(10_000);}Trace Export for Analysis Pipelines
Section titled “Trace Export for Analysis Pipelines”The JSON trace format is designed for programmatic consumption. Feed it into your log aggregation system, build custom analysis scripts, or import it into a notebook:
// Export to a specific directory with timestamped filenames.traceExporter(TraceExporter.json(Path.of("traces/")))
// Or get the raw JSON stringString traceJson = output.getTrace().toJson();logAggregator.ingest("agent-trace", traceJson);Layer 2: Capture Mode
Section titled “Layer 2: Capture Mode”Traces tell you what happened. Capture mode tells you exactly what happened — including the full prompts, raw LLM responses, and tool call payloads.
Three Levels
Section titled “Three Levels”Ensemble.builder() .agents(researcher, writer) .tasks(researchTask, writeTask) .chatLanguageModel(model) .captureMode(CaptureMode.FULL) // OFF, STANDARD, or FULL .build() .run();| Level | What’s Captured | Use Case |
|---|---|---|
OFF | Standard metrics only | Production |
STANDARD | + Full LLM message history per iteration, memory operations | Staging, initial deployment |
FULL | + Tool call I/O payloads, raw LLM responses, detailed timing | Development, debugging |
What STANDARD Adds
Section titled “What STANDARD Adds”With CaptureMode.STANDARD, each task’s execution record includes the full conversation between the framework and the LLM:
Task: Research emerging trends Iteration 1: System prompt: "You are Senior Research Analyst. Your goal is..." User message: "Research emerging trends in AI thoroughly..." Assistant response: "I'll search for the latest information..." Tool call: WebSearch("emerging AI trends 2024") Iteration 2: System prompt: [same] User message: [previous context + tool result] Assistant response: "Based on my research, here are the key..."This is invaluable for understanding why an agent behaved a certain way. You can see exactly what prompt it received, what context was injected, and how it reasoned through the task.
What FULL Adds
Section titled “What FULL Adds”CaptureMode.FULL adds the raw payloads for every interaction:
- Tool call inputs: The exact arguments passed to each tool.
- Tool call outputs: The exact response from each tool.
- Raw LLM responses: The complete response body, including any JSON that was parsed.
- Timing breakdowns: Per-iteration timing, not just per-task.
This is the level you use when something is wrong and you can’t figure out why from the trace alone. It’s verbose — expect significantly more data — but it gives you full replay capability.
Using Capture Data in Tests
Section titled “Using Capture Data in Tests”Capture mode is a testing power tool. Record a full execution, then write assertions against the captured data:
@Testvoid researcherShouldUseWebSearch() { EnsembleOutput output = Ensemble.builder() .agents(researcher, writer) .tasks(researchTask, writeTask) .chatLanguageModel(model) .captureMode(CaptureMode.FULL) .build() .run();
// Verify the researcher used the web search tool ExecutionTrace trace = output.getTrace(); TraceSpan researchSpan = trace.getSpans().get(0);
boolean usedWebSearch = researchSpan.getChildren().stream() .anyMatch(child -> child.getName().contains("WebSearch")); assertThat(usedWebSearch).isTrue();}You can also capture a “golden run” and use it as a reference for regression testing — comparing future runs against the expected execution pattern.
Layer 3: Event Callbacks
Section titled “Layer 3: Event Callbacks”For real-time debugging during development, callbacks give you a live stream of execution events:
Ensemble.builder() .agents(researcher, analyst, writer) .tasks(researchTask, analysisTask, writeTask) .chatLanguageModel(model) .listener(event -> { switch (event) { case TaskStartEvent e -> System.out.printf("%n>>> Starting: %s (agent: %s)%n", e.taskDescription(), e.agentRole());
case TaskCompleteEvent e -> System.out.printf("<<< Completed: %s (%dms, %d tokens)%n", e.taskDescription(), e.durationMs(), e.tokenCount());
case TaskFailedEvent e -> System.err.printf("!!! Failed: %s -- %s%n", e.taskDescription(), e.errorMessage());
case ToolCallEvent e -> System.out.printf(" [tool] %s(%s) -> %s%n", e.toolName(), truncate(e.input(), 50), truncate(e.result(), 100));
case DelegationStartedEvent e -> System.out.printf(" [delegate] %s -> %s%n", e.fromAgent(), e.toAgent());
case TokenEvent e -> // Streaming: print tokens as they arrive System.out.print(e.token());
default -> {} } }) .build() .run();This gives you a live play-by-play of the ensemble execution in your terminal. You see each task start and complete, each tool call and its result, and each delegation in hierarchical workflows.
Combining Callbacks with Logging
Section titled “Combining Callbacks with Logging”For persistent debugging output, route events to your logging framework:
.listener(event -> { if (event instanceof TaskCompleteEvent e) { log.info("Task completed: task={}, agent={}, duration={}ms, tokens={}", e.taskDescription(), e.agentRole(), e.durationMs(), e.tokenCount()); } if (event instanceof TaskFailedEvent e) { log.error("Task failed: task={}, error={}", e.taskDescription(), e.errorMessage()); }})These flow into your existing log aggregation pipeline (ELK, Splunk, CloudWatch Logs) alongside your application’s other logs.
Layer 4: The Live Dashboard
Section titled “Layer 4: The Live Dashboard”For the most visual debugging experience, AgentEnsemble includes a live browser dashboard:
Ensemble.builder() .agents(researcher, analyst, writer) .tasks(researchTask, analysisTask, writeTask) .chatLanguageModel(model) .devtools(Devtools.enabled()) .build() .run();When the ensemble starts, a browser window opens (or a URL is printed to the console) showing a real-time visualization of the execution:
What the Dashboard Shows
Section titled “What the Dashboard Shows”- DAG Visualization: A graph of all tasks and their dependencies. Nodes change color as tasks progress from pending to running to completed.
- Agent Activity: Which agent is currently active, what it’s doing, and how many iterations it’s taken.
- Token Consumption: Real-time token counters per task and for the entire ensemble.
- Task Output Preview: Click on a completed task to see its output.
- Timeline: A Gantt-chart-style view of task execution, showing parallelism and bottlenecks.
When to Use It
Section titled “When to Use It”The live dashboard is a development tool, not a production monitoring dashboard. Use it when:
- Building a new agent workflow and you want to see the execution flow.
- Debugging why a specific task takes too long or produces unexpected output.
- Demonstrating an agent system to stakeholders.
- Understanding the parallelism in a DAG or MapReduce workflow.
For production monitoring, use the Micrometer metrics integration and your existing Grafana/Prometheus stack.
Debugging Recipes
Section titled “Debugging Recipes”Here are specific debugging scenarios and how to approach them with the tools above.
”The output is wrong, but I don’t know which agent failed”
Section titled “”The output is wrong, but I don’t know which agent failed””Use traces. Look at each task’s output in the trace tree. Find the first task whose output is incorrect — that’s where things diverged.
.traceExporter(TraceExporter.json(Path.of("debug/")))Then read the trace JSON, find the task with bad output, and check its input context to see what it received from upstream tasks.
”The agent keeps calling the same tool in a loop”
Section titled “”The agent keeps calling the same tool in a loop””Use capture mode + callbacks. Enable CaptureMode.FULL and add a callback that logs tool calls:
.captureMode(CaptureMode.FULL).listener(event -> { if (event instanceof ToolCallEvent e) { log.warn("Tool call: {} with input: {}", e.toolName(), e.input()); }})Then check the captured LLM conversation to see why the agent keeps making the same call. Usually it’s a prompt issue — the agent doesn’t recognize the tool result as sufficient.
”The structured output parsing keeps failing”
Section titled “”The structured output parsing keeps failing””Use capture mode. Enable CaptureMode.FULL and check the raw LLM response:
.captureMode(CaptureMode.FULL)The captured data includes the raw response before parsing. Compare it to your record schema. Common issues:
- The LLM wraps JSON in markdown code blocks.
- Field names don’t match (the LLM uses
camelCase, the record usessnake_case). - The LLM adds extra fields or comments.
The framework handles most of these, but FULL capture mode shows you exactly what’s happening.
”A parallel workflow is slower than expected”
Section titled “”A parallel workflow is slower than expected””Use the live dashboard. Enable devtools and look at the timeline view:
.devtools(Devtools.enabled())You’ll see whether tasks are actually running in parallel or if there’s an unexpected dependency bottleneck. Common issues:
- A task accidentally depends on another task via
context()when it shouldn’t. - One task takes much longer than the others, creating a bottleneck for downstream tasks.
- Rate limiting is causing parallel tasks to serialize.
”I need to understand the full prompt the agent received”
Section titled “”I need to understand the full prompt the agent received””Use CaptureMode.STANDARD or CaptureMode.FULL. The captured data includes the complete system prompt, user message, and any injected context for each LLM call.
This is the only way to see the actual prompt — the framework constructs it dynamically from the agent’s role/goal/background, the task description, context from previous tasks, and tool results.
Putting It All Together
Section titled “Putting It All Together”A typical debugging setup during development:
EnsembleOutput output = Ensemble.builder() .agents(researcher, analyst, writer) .tasks(researchTask, analysisTask, writeTask) .chatLanguageModel(model) // Full observability stack .captureMode(CaptureMode.FULL) .traceExporter(TraceExporter.json(Path.of("traces/"))) .devtools(Devtools.enabled()) .listener(event -> { if (event instanceof TaskCompleteEvent e) { log.info("[DONE] {} -- {}ms", e.taskDescription(), e.durationMs()); } if (event instanceof ToolCallEvent e) { log.info("[TOOL] {} -> {}", e.toolName(), e.result()); } }) .costConfiguration(CostConfiguration.builder() .inputTokenCostPer1k(0.01) .outputTokenCostPer1k(0.03) .build()) .build() .run();
// Post-run analysisEnsembleMetrics metrics = output.getMetrics();log.info("Total cost: ${}, tokens: {}, duration: {}ms", metrics.getTotalCost(), metrics.getTotalTokens(), output.getTotalDuration());For production, dial it back:
EnsembleOutput output = Ensemble.builder() .agents(researcher, analyst, writer) .tasks(researchTask, analysisTask, writeTask) .chatLanguageModel(model) // Production observability .captureMode(CaptureMode.OFF) .traceExporter(TraceExporter.json(Path.of("/var/log/agent-traces/"))) .meterRegistry(prometheusMeterRegistry) .listener(productionEventHandler) .costConfiguration(costConfig) .build() .run();The observability stack scales from “show me everything” during development to “show me what matters” in production. Same API, different configuration.
The Core Idea
Section titled “The Core Idea”Multi-agent systems are opaque by nature. An LLM call is a black box — you send a prompt, you get a response, and the reasoning happens inside the model. The only way to make agent systems debuggable is to capture and structure everything around those black box calls: what went in, what came out, how long it took, and how it fits into the broader execution flow.
That’s what traces, capture mode, callbacks, and the live dashboard provide. Not transparency into the model, but transparency around it. And in practice, that’s enough to debug anything.
Get started:
- Documentation — guides, examples, and API reference
- Capture Mode Guide — deep execution recording
- Metrics Guide — Micrometer integration
- Live Dashboard Guide — real-time execution visualization
- Getting Started — up and running in 5 minutes
- GitHub — source, issues, and contributions
AgentEnsemble is MIT-licensed and available on GitHub.