Agent Workflows on the JVM: Typed, Observable, and Composable

Mar 24, 2026

Three properties separate a toy agent framework from one you can actually ship:

Typed — the compiler helps you, not just the runtime.
Observable — you can see what happened, why, and how much it cost.
Composable — the same building blocks produce fundamentally different architectures.

Most agent frameworks in the Python ecosystem nail one of these. Maybe two if you squint. Getting all three usually means stitching together multiple libraries with glue code.

On the JVM, we can do better. This post is a technical deep-dive into how AgentEnsemble achieves all three, and what that means for the architecture of your agent systems.

Typed: The Compiler as a Collaborator

Structured Output with Java Records

The most immediate benefit of type safety in agent systems is structured output. Instead of parsing strings or hoping the LLM returns valid JSON, you declare a Java record and let the framework handle serialization:

record MarketAnalysis(
    String company,
    String sector,
    List<String> competitors,
    double marketCapBillions,
    String outlook
) {}

Task analysisTask = Task.builder()
    .description("Analyze the market position of {{company}}")
    .expectedOutput("A structured market analysis")
    .agent(analyst)
    .outputType(MarketAnalysis.class)
    .build();

When this task runs, the framework:

Derives a JSON schema from the record’s structure.
Instructs the LLM to return output conforming to that schema.
Deserializes the response into a MarketAnalysis instance.
Retries automatically if deserialization fails (configurable via maxOutputRetries()).

The result is compile-time type checking on agent output:

MarketAnalysis analysis = output.getTaskOutputs().get(0)
    .getStructuredOutput(MarketAnalysis.class);

// These are real typed fields, not dictionary lookups
String sector = analysis.sector();
List<String> competitors = analysis.competitors();
double marketCap = analysis.marketCapBillions();

No (String) map.get("sector"). No JSONObject.getString(). No runtime ClassCastException at 2 AM.

Nested Types

Records can be nested, and the framework handles it:

record CompetitorDetail(String name, double marketShare, String strategy) {}

record IndustryReport(
    String industry,
    List<CompetitorDetail> competitors,
    String trendSummary
) {}

Task reportTask = Task.builder()
    .description("Create an industry report for {{industry}}")
    .expectedOutput("Structured industry report")
    .agent(analyst)
    .outputType(IndustryReport.class)
    .build();

The LLM receives a schema that includes the nested CompetitorDetail structure, and the entire object graph is deserialized in one pass.

Immutable Domain Objects

Beyond output typing, the framework’s own domain model is fully immutable. Agent, Task, Ensemble, TaskOutput — all built with the builder pattern, all immutable after construction:

Agent agent = Agent.builder()
    .role("Analyst")
    .goal("Analyze data")
    .maxIterations(15)
    .build();

// Need a variant? Use toBuilder()
Agent verboseAgent = agent.toBuilder()
    .verbose(true)
    .build();

No mutable state to worry about across threads. No defensive copies. The builder gives you construction-time flexibility; immutability gives you runtime safety.

Observable: See Everything

Observability in agent systems isn’t a single feature — it’s a stack. AgentEnsemble provides four layers, each useful at a different stage of development and operation.

Layer 1: Event Callbacks (Development and Real-Time)

The callback system fires typed events at every significant point in execution:

Ensemble.builder()
    // ...
    .listener(event -> {
        switch (event) {
            case TaskStartEvent e ->
                log.info("[START] {}", e.taskDescription());
            case TaskCompleteEvent e ->
                log.info("[DONE] {} -- {}ms, {} tokens",
                    e.taskDescription(), e.durationMs(), e.tokenCount());
            case ToolCallEvent e ->
                log.info("[TOOL] {} called by {}",
                    e.toolName(), e.agentRole());
            case DelegationStartedEvent e ->
                log.info("[DELEGATE] {} -> {}",
                    e.fromAgent(), e.toAgent());
            case TokenEvent e ->
                // streaming token-by-token output
                System.out.print(e.token());
            default -> {}
        }
    })
    .build()
    .run();

Events are sealed types. The compiler tells you what’s available. Multiple listeners are supported. Listener exceptions are caught and logged — they never break the ensemble.

Layer 2: Metrics (Production Monitoring)

The Micrometer integration publishes standard metrics to whatever backend you already use — Prometheus, Datadog, CloudWatch:

Ensemble.builder()
    // ...
    .meterRegistry(registry)
    .costConfiguration(CostConfiguration.builder()
        .inputTokenCostPer1k(0.01)
        .outputTokenCostPer1k(0.03)
        .build())
    .build()
    .run();

Published metrics include:

Metric	Type	Tags
`ensemble.task.duration`	Timer	agent, task
`ensemble.task.tokens`	Counter	agent, task, direction (input/output)
`ensemble.task.completions`	Counter	agent, task, status
`ensemble.tool.calls`	Counter	tool, agent
`ensemble.run.cost`	Gauge	—
`ensemble.run.duration`	Timer	workflow

These flow straight into your existing Grafana dashboards. No new monitoring infrastructure required.

Layer 3: Structured Traces (Post-Mortem Analysis)

For detailed post-mortem analysis, export full execution traces:

Ensemble.builder()
    // ...
    .traceExporter(TraceExporter.json(Path.of("traces/")))
    .build()
    .run();

Each trace is a tree of spans:

Ensemble Run
  +-- Task: Market Research (3,240ms, 1,847 tokens)
  |     +-- LLM Call #1 (2,100ms)
  |     +-- Tool: WebSearch (890ms)
  |     +-- LLM Call #2 (250ms)
  +-- Task: Write Report (2,100ms, 1,203 tokens)
        +-- LLM Call #1 (2,100ms)

Spans include duration, token counts, input/output payloads, and tool call details. The JSON format is structured for programmatic analysis — feed it into your log aggregation pipeline or write custom analysis scripts.

Layer 4: Capture Mode (Deep Debugging)

When you need to see everything — the full prompt sent to the LLM, the raw response, every tool call input and output:

Ensemble.builder()
    // ...
    .captureMode(CaptureMode.FULL)
    .build()
    .run();

Three levels:

OFF — standard execution, no extra data collection.
STANDARD — adds full LLM message history per iteration and memory operations.
FULL — adds tool call input/output payloads, raw LLM responses, and detailed timing.

FULL mode is for development and debugging. STANDARD is for staging. OFF is for production (unless you’re investigating an issue).

Layer 5: Live Dashboard (Development)

During development, launch a live browser dashboard that shows execution progress in real time:

Ensemble.builder()
    // ...
    .devtools(Devtools.enabled())
    .build()
    .run();

The dashboard shows a DAG visualization of tasks, real-time progress indicators, agent activity, and token consumption. It’s a development tool, not a production dashboard — but it makes building and debugging agent workflows dramatically faster.

Composable: Same Blocks, Different Buildings

The real test of a framework’s composability is whether fundamentally different architectures emerge from the same primitives.

The Primitives

AgentEnsemble has three core primitives:

Agent — an AI entity with a role, goal, and optional tools.
Task — a unit of work with a description, expected output, and optional dependencies.
Ensemble — a group of agents executing a set of tasks.

That’s it. Everything else is a composition of these three.

Sequential Pipelines

Tasks with linear dependencies:

Task research = Task.builder()/* ... */.build();
Task write = Task.builder()/* ... */.context(List.of(research)).build();
Task edit = Task.builder()/* ... */.context(List.of(write)).build();

The framework sees a chain and executes sequentially.

Parallel DAGs

Tasks with branching dependencies:

Task taskA = Task.builder()/* ... */.build();
Task taskB = Task.builder()/* ... */.build();
Task taskC = Task.builder()/* ... */
    .context(List.of(taskA, taskB)).build();

The framework sees A and B are independent, runs them concurrently, then runs C when both complete.

Hierarchical Delegation

Tasks without assigned agents, plus a manager model:

Ensemble.builder()
    .agents(worker1, worker2, worker3)
    .tasks(unassignedTask)
    .workflow(Workflow.HIERARCHICAL)
    .chatLanguageModel(managerModel)
    .build()
    .run();

The manager agent delegates to workers. The same agent and task primitives, different execution semantics.

MapReduce

For data-parallel workloads, the MapReduceEnsemble composes the same primitives into a fan-out/fan-in pattern:

MapReduceEnsemble.<String, String>builder()
    .items(List.of("Q1 Report", "Q2 Report", "Q3 Report", "Q4 Report"))
    .mapAgentFactory(item -> Agent.builder()
        .role("Analyst for " + item)
        .goal("Analyze " + item)
        .build())
    .mapTaskFactory((item, agent) -> Task.builder()
        .description("Analyze " + item)
        .expectedOutput("Analysis of " + item)
        .agent(agent)
        .build())
    .reduceAgent(Agent.builder()
        .role("Senior Analyst")
        .goal("Synthesize all quarterly analyses")
        .build())
    .reduceTaskFactory((results, agent) -> Task.builder()
        .description("Combine all quarterly analyses into an annual summary")
        .expectedOutput("Annual summary report")
        .agent(agent)
        .build())
    .chatLanguageModel(model)
    .build()
    .run();

Each item gets its own agent and task. Map tasks run in parallel. The reduce task collects all results. The same Agent.builder() and Task.builder() APIs, composed into a completely different execution pattern.

Dynamic Ensembles

Agents and tasks can be created at runtime based on input data:

List<String> topics = List.of("AI", "Quantum Computing", "Biotech");

List<Agent> agents = new ArrayList<>();
List<Task> tasks = new ArrayList<>();

for (String topic : topics) {
    Agent specialist = Agent.builder()
        .role(topic + " Specialist")
        .goal("Provide expert analysis of " + topic)
        .build();
    agents.add(specialist);

    tasks.add(Task.builder()
        .description("Analyze recent developments in " + topic)
        .expectedOutput("Expert analysis")
        .agent(specialist)
        .build());
}

// Add a synthesis task that depends on all specialist tasks
Agent synthesizer = Agent.builder()
    .role("Chief Analyst")
    .goal("Synthesize all specialist analyses")
    .build();
agents.add(synthesizer);

tasks.add(Task.builder()
    .description("Create a unified report from all analyses")
    .expectedOutput("Comprehensive cross-domain report")
    .agent(synthesizer)
    .context(tasks) // depends on all previous tasks
    .build());

Ensemble.builder()
    .agents(agents)
    .tasks(tasks)
    .chatLanguageModel(model)
    .build()
    .run();

The number of agents and tasks is determined by the input data. The dependency graph is constructed dynamically. The framework handles the rest.

The Architecture Insight

What makes all of this work is a simple design decision: context() declarations are the source of truth for execution order.

When you call .context(List.of(taskA, taskB)) on a task, you’re declaring a dependency edge in a directed acyclic graph. The framework builds this graph at ensemble construction time, performs a topological sort, and schedules execution accordingly.

This means:

Workflow inference works because the graph already encodes the execution strategy.
Parallel scheduling works because independent subgraphs can be identified and executed concurrently.
Task output passing works because the framework knows which outputs to inject as context for downstream tasks.
Error propagation works because the graph defines the blast radius of a failure.

The graph is the architecture. Everything else follows from it.

Bringing It Together

Agent orchestration on the JVM doesn’t have to mean trading away the properties that make Java effective for production systems. Type safety, observability, and composability aren’t luxuries — they’re the foundations that let you ship with confidence.

AgentEnsemble gives you all three from the same set of builders, the same primitives, the same dependency model. Whether you’re building a two-agent pipeline or a dynamic MapReduce ensemble, the API is consistent and the properties hold.

Get started:

Documentation — guides, examples, and API reference
Getting Started — up and running in 5 minutes
Examples — runnable code for every pattern
GitHub — source, issues, and contributions

AgentEnsemble is MIT-licensed and available on GitHub.