Skip to content
AgentEnsemble AgentEnsemble
Get Started

24 - Ensemble Network: Distributed Multi-Ensemble Orchestration

This document specifies the design for the Ensemble Network: a distributed architecture where autonomous, long-running ensembles communicate peer-to-peer, share tasks and tools across service boundaries, and allow humans to participate as optional observers and decision-makers.

This is the v3.0.0 architecture. It builds on the v2.1.0 agentensemble-web module (WebSocket server, wire protocol, live dashboard) and extends it into a fully distributed multi-ensemble system.


The limitation of single-ensemble execution

Section titled “The limitation of single-ensemble execution”

AgentEnsemble v2.x treats each ensemble as a self-contained unit: define tasks, run them, get output. This works well for discrete, bounded problems — “research this topic and write a report.” But real-world AI systems are not discrete. They are:

  • Always-on: running continuously, handling work as it arrives
  • Multi-domain: different capabilities owned by different teams/services
  • Decentralized: departments that communicate laterally, not through a central controller
  • Human-augmented: people who observe, direct, and make critical decisions — but who also go home at night while the system keeps running

Consider a hotel. It is composed of departments: front desk, housekeeping, kitchen, room service, maintenance, procurement, accounting. Each department is autonomous — it has its own staff, its own processes, its own expertise. The departments communicate with each other directly: room service calls the kitchen to prepare a meal, maintenance calls procurement to order spare parts.

The hotel runs 24/7/365. Humans — the manager, the receptionist, the bell staff — come and go. When the manager leaves for the night, the hotel does not stop. It keeps running. When the manager arrives in the morning, they observe the current state, give direction where needed, and handle decisions that require their authority (like opening the safe).

This is the model AgentEnsemble v3.0.0 implements:

HotelAgentEnsemble
A department (kitchen, maintenance)An Ensemble — long-running, autonomous
Staff within a departmentAgents and Tasks within the ensemble
The intercom / phone systemWebSocket mesh — the message transport
A guest request or work orderA WorkRequest — the standard message envelope
The hotel directoryService registry — ensembles discover each other
A duty managerA human who connects via the dashboard to observe and intervene
The shared guest ledgerShared memory — cross-ensemble state
The hotel chainA federation — multiple realms sharing capacity

Existing multi-agent frameworks (CrewAI, AutoGen, LangGraph) and protocols (MCP) operate within a single process boundary. MCP provides tool-level interoperability (call a function, get a result). AgentEnsemble Network provides ensemble-level interoperability: one ensemble delegates a complex, multi-step task to another ensemble, which runs it with its own agents, tools, memory, and review gates. The delegating ensemble is the beneficiary of the output — it does not need to know or care about the internal process.

This is the difference between “borrow a tool” and “hire a department.”


Browser (agentensemble-viz /network route)
|
| WebSocket (human portal)
v
+-------------+ +-------------+ +-------------+ +-------------+
| Ensemble A |<-->| Ensemble B |<-->| Ensemble C |<-->| Ensemble D |
| (kitchen) | | (room-svc) | | (maintenance)| | (procurement)|
+-------------+ +-------------+ +-------------+ +-------------+
| | | |
+------ Durable Queue / Topic (Kafka, Redis Streams) ---+
| | | |
+---------- Shared Result Store (Redis) ----------------+
| | | |
+---------- Shared Memory Scopes (MemoryStore SPI) -----+

Each ensemble is deployed as a Kubernetes Service (one or more pods). They discover each other via K8s DNS. Communication flows over WebSocket for real-time events and over durable queues for reliable work delivery. Shared state lives in external stores.

  1. Ensembles — autonomous, always running. Handle their domain. Communicate with peers.
  2. Humans — come and go. Observe, direct, query, approve gated decisions.
  3. External systems — submit work via HTTP API, queue, or webhook. Consume results.

All three interact through the same WorkRequest envelope and wire protocol.


EnsembleOutput output = Ensemble.run(model,
Task.of("Research AI trends"),
Task.of("Write a report"));

Tasks execute, output is returned, the ensemble is done. Unchanged from v2.x.

Long-running (new v3.0 — a “residency”)

Section titled “Long-running (new v3.0 — a “residency”)”
Ensemble kitchen = Ensemble.builder()
.name("kitchen")
.chatLanguageModel(model)
.task(Task.of("Manage kitchen operations"))
// Share capabilities to the network
.shareTask("prepare-meal", Task.builder()
.description("Prepare a meal as specified")
.expectedOutput("Confirmation with preparation details and timing")
.build())
.shareTool("check-inventory", inventoryTool)
.shareTool("dietary-check", allergyCheckTool)
// Scheduled proactive task
.scheduledTask(ScheduledTask.builder()
.name("inventory-report")
.task(Task.of("Check current inventory levels and report shortages"))
.schedule(Schedule.every(Duration.ofHours(1)))
.broadcastTo("hotel.inventory")
.build())
.build();
kitchen.start(7329); // WebSocket server, K8s Service fronts this

In long-running mode, the ensemble:

  • Registers its shared tasks and tools on the network
  • Accepts incoming WorkRequests (via WebSocket, queue, HTTP, or topic subscription)
  • Processes work through its priority queue
  • Delivers results via the caller-specified delivery method
  • Runs scheduled proactive tasks on their configured intervals
  • Continues until explicitly stopped or drained

Share a Task: “Kitchen, make a club sandwich”

Section titled “Share a Task: “Kitchen, make a club sandwich””

An ensemble exposes a named task that other ensembles can trigger. The target ensemble runs the full task with its own agents, tools, and context. The caller hands off the work and gets back a result.

// Kitchen shares the "prepare-meal" task
Ensemble kitchen = Ensemble.builder()
.name("kitchen")
.shareTask("prepare-meal", Task.builder()
.description("Prepare a meal as specified")
.expectedOutput("Confirmation with prep time and details")
.build())
.build();

Share a Tool: “Can I borrow your meat thermometer?”

Section titled “Share a Tool: “Can I borrow your meat thermometer?””

An ensemble exposes a specific tool that other ensembles’ agents can call directly in their ReAct loop. The tool executes in the owning ensemble’s context but returns results to the calling agent.

// Kitchen shares the "check-inventory" tool
Ensemble kitchen = Ensemble.builder()
.name("kitchen")
.shareTool("check-inventory", inventoryTool)
.build();
// Room service uses kitchen's shared task and tool
Ensemble roomService = Ensemble.builder()
.name("room-service")
.chatLanguageModel(model)
.task(Task.builder()
.description("Handle guest room service request")
.tools(
// Delegates the full "prepare-meal" task to kitchen
NetworkTask.from("kitchen", "prepare-meal"),
// Uses kitchen's inventory tool directly in the ReAct loop
NetworkTool.from("kitchen", "check-inventory"),
NetworkTool.from("kitchen", "dietary-check"),
// Delegates to maintenance
NetworkTask.from("maintenance", "repair-request"))
.build())
.build();

Both NetworkTask and NetworkTool implement the existing AgentTool interface. An agent does not know whether a tool is local or remote. The existing ReAct loop, tool executor, metrics, and tracing all work unchanged.

NetworkTool (synchronous tool call):

  1. Agent calls check-inventory("wagyu beef")
  2. NetworkTool serializes the call into a WorkRequest
  3. Request is sent to kitchen (WebSocket or queue)
  4. Kitchen executes inventoryTool.execute("wagyu beef") locally
  5. Result flows back: "Yes, 3 portions available"
  6. Agent continues its ReAct loop

NetworkTask (cross-ensemble delegation):

  1. Agent calls prepare-meal("Wagyu steak, medium-rare, room 403")
  2. NetworkTask serializes a WorkRequest with the full task context
  3. Request is sent to kitchen
  4. Kitchen runs its complete task pipeline (agent synthesis, execution, review gates)
  5. Result flows back: "Preparing now, estimated 25 minutes, ticket #4071"
  6. Agent continues

Every cross-ensemble message uses a standardized envelope:

public record WorkRequest(
String requestId, // Correlation + idempotency key
String from, // Requesting ensemble name
String task, // Shared task or tool name to execute
String context, // Natural language input/context
Priority priority, // CRITICAL / HIGH / NORMAL / LOW
Duration deadline, // Caller's SLA ("I need this within...")
DeliverySpec delivery, // How and where to return the result
String traceContext, // W3C traceparent for distributed tracing
CachePolicy cachePolicy, // USE_CACHED / FORCE_FRESH
String cacheKey // Optional, for result caching
) {}
public record DeliverySpec(
DeliveryMethod method, // WEBSOCKET / QUEUE / TOPIC / WEBHOOK / STORE / BROADCAST_CLAIM / NONE
String address // Method-specific address
) {}
MethodAddressBehavior
WEBSOCKETws://maintenance:7329/wsDirect, real-time
QUEUEmaintenance.resultsDurable point-to-point (Redis Streams, SQS)
TOPICmaintenance.resultsDurable pub/sub (Kafka); multiple consumers
WEBHOOKhttps://maintenance.internal/callbackHTTP POST
STOREKey in shared result storeWrite to store; requester polls/subscribes
BROADCAST_CLAIMService nameOffer to all replicas; first to claim receives payload
NONEFire and forget

Work can arrive at an ensemble from multiple sources simultaneously:

IngressDescription
WebSocketDirect from another ensemble (real-time)
QueuePull from durable queue (Kafka, SQS, Redis Streams)
HTTP APIPOST /api/work (external systems, scripts, CI pipelines)
Topic subscriptionReact to events from other ensembles
ScheduleInternal cron/interval (proactive tasks)

All ingress sources normalize to the same WorkRequest envelope before entering the ensemble’s priority queue.


ModeTriggerOutput destination
Shared (reactive)External WorkRequestResponse to requester via delivery spec
Scheduled (proactive)Cron/intervalBroadcast to topic
Internal (private)Part of the ensemble’s own workflowInternal state
ModeBehaviorUse case
AwaitBlock until result (like current delegation)Critical path: “Can’t continue without the parts”
AsyncSubmit and continue; result delivered later via callbackNon-critical: “Order towels when you get to it”
Await with deadlineWait up to N; then continue with partial/no resultBalanced: “Wait 30 min, then proceed with what I know”

7. “Bend, Don’t Break” — Capacity Management

Section titled “7. “Bend, Don’t Break” — Capacity Management”

LLM tasks are not real-time request/response. They take seconds to hours. Everyone expects latency. The default response to load is accept and queue, not reject:

Request arrives
-> Is this ensemble alive?
No -> route to alternative (federation) or queue at network level
Yes -> Accept into priority queue
-> ACK with queue position + estimated completion time
-> Process when capacity is available
-> Deliver result when done

Rejection only happens at hard limits (queue itself is full — the hotel is physically out of rooms).

The limit is set by the requester, not the provider:

{
"type": "task_request",
"requestId": "maint-7721",
"task": "purchase-parts",
"deadline": "PT30M",
"priority": "HIGH"
}
{
"type": "task_accepted",
"requestId": "maint-7721",
"queuePosition": 7,
"estimatedCompletion": "PT45M"
}

When ETA exceeds the caller’s deadline, the caller decides: accept the longer wait, cancel and try another provider (federation), or continue without.

Requests are processed by priority (CRITICAL > HIGH > NORMAL > LOW). Within the same priority, FIFO. Low-priority items age over time to prevent starvation (configurable). Humans on the dashboard can re-prioritize individual requests.

NetworkProfile sportingEvent = NetworkProfile.builder()
.name("sporting-event-weekend")
.ensemble("front-desk", Capacity.replicas(4).maxConcurrent(50))
.ensemble("kitchen", Capacity.replicas(3).maxConcurrent(100))
.ensemble("room-service", Capacity.replicas(3).maxConcurrent(80))
.ensemble("maintenance", Capacity.replicas(1).maxConcurrent(10))
.preload("kitchen", "inventory", "Extra beer and ice stocked")
.build();
network.applyProfile(sportingEvent);

Profiles define expected capacity needs for known events. They set K8s HPA targets, activate dormant ensembles, and pre-load shared memory. Profiles can be applied manually, on a schedule, or via rules.


The system runs autonomously. Humans connect when they want, observe, give direction, and disconnect. The system does not depend on them.

LevelHotel exampleBehavior
AutonomousHousekeeping cleans after checkoutNo human needed
AdvisoryManager says “prioritize VIP”Human input welcomed but not required
Notifiable”Water leak in 305”Alert a human, proceed with best-effort
ApprovableGuest requests late checkoutAsk human if available, auto-approve on timeout
GatedOpening the safeCannot proceed without human authorization

Some processes require specific human authorization:

Task openSafe = Task.builder()
.description("Open the hotel safe for cash reconciliation")
.review(Review.builder()
.prompt("Manager authorization required to open the safe")
.requiredRole("manager")
.timeout(Duration.ZERO) // no timeout -- wait until a human decides
.build())
.build();

When a gated review fires and no qualified human is connected:

  1. The review is queued
  2. An optional out-of-band notification is sent (Slack, email, webhook)
  3. The task waits
  4. When a qualified human connects, they see the pending review immediately
  5. They approve (or reject), and the task resumes

Humans can inject guidance into any ensemble they have access to:

{
"type": "directive",
"to": "room-service",
"from": "manager:human",
"content": "Guest in 801 is VIP, prioritize all their requests"
}

Directives are non-blocking. They are injected as additional context for future task executions.

Humans (or automated policies) can send control plane directives to change ensemble behavior at runtime:

{
"type": "directive",
"to": "kitchen",
"from": "cost-policy:automated",
"action": "SET_MODEL_TIER",
"value": "FALLBACK"
}

This switches the ensemble to a cheaper LLM model without restarting. The ensemble has configurable model tiers:

Ensemble.builder()
.chatLanguageModel(gpt4) // primary
.fallbackModel(gpt4Mini) // cheaper fallback
.build();

The existing late-join snapshot mechanism (v2.1.0 hello + snapshotTrace) extends to the network level. When a human connects to the dashboard, they receive the current state of all ensembles they have access to. Live events start streaming immediately.


Each ensemble is a K8s Deployment + Service. They find each other by DNS name:

Namespace: hotel-downtown
+-- Service: kitchen (ws://kitchen:7329/ws)
+-- Service: room-service (ws://room-service:7329/ws)
+-- Service: maintenance (ws://maintenance:7329/ws)
+-- Service: front-desk (ws://front-desk:7329/ws)
+-- Service: dashboard (http://dashboard:7400)

K8s provides DNS, health checks, load balancing, and namespace-based compartmentalization. The framework does not build a custom service registry.

When an ensemble starts, it publishes its shared capabilities:

{
"type": "ensemble_register",
"name": "kitchen",
"capabilities": {
"sharedTasks": [
{ "name": "prepare-meal", "description": "Prepare a meal as specified" }
],
"sharedTools": [
{ "name": "check-inventory", "description": "Check ingredient availability" },
{ "name": "dietary-check", "description": "Verify allergen safety" }
]
}
}

Ensembles can discover capabilities by name or by semantic description:

// By name
NetworkTool.from("kitchen", "check-inventory")
// By capability query -- find whoever provides it
NetworkTool.discover("check-inventory")
// Dynamic catalog -- resolve at execution time
Task.builder()
.tools(NetworkToolCatalog.all()) // all tools on the network
.tools(NetworkToolCatalog.tagged("food")) // filtered by tag
.build();

NetworkToolCatalog.all() resolves at task execution time, not build time. A new ensemble comes online and its tools are immediately available to every agent on the network.

Multiple K8s clusters (or namespaces) form a federation. Each is a realm — a trust and discovery boundary.

Federation: "Hotel Chain"
+-- Realm: hotel-downtown (K8s namespace)
+-- Realm: hotel-airport (K8s namespace, same or different cluster)
+-- Realm: hotel-beach (K8s namespace, different region)

Within a federation, ensembles can discover and use capabilities from other realms. This enables elastic capacity sharing: when Hotel A’s kitchen is at capacity during a conference, it can route overflow to Hotel B’s kitchen.

Capacity advertisement:

{
"type": "capacity_update",
"ensemble": "kitchen",
"realm": "hotel-airport",
"status": "available",
"currentLoad": 0.2,
"maxConcurrent": 10,
"shareable": true
}

shareable: true means this ensemble’s spare capacity is available to other realms.


W3C Trace Context in the wire protocol (mandatory)

Section titled “W3C Trace Context in the wire protocol (mandatory)”

Every cross-ensemble message carries trace context:

{
"type": "task_request",
"requestId": "maint-7721",
"traceContext": {
"traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01",
"tracestate": "agentensemble=maintenance"
}
}

This is always present regardless of whether the user has OpenTelemetry configured.

OpenTelemetry integration (optional module)

Section titled “OpenTelemetry integration (optional module)”

agentensemble-telemetry-opentelemetry creates OTel spans at key points:

SpanWhen
ensemble.runRoot span for an ensemble execution
task.executePer-task child span
llm.callPer-LLM-interaction child span (with token count attributes)
tool.executePer-tool-call child span
network.delegateCLIENT span when calling another ensemble
network.handleSERVER span when receiving a cross-ensemble request

Spans carry AgentEnsemble-specific attributes:

agentensemble.ensemble.name = "maintenance"
agentensemble.task.description = "Fix boiler in building 2"
agentensemble.agent.role = "Senior Maintenance Engineer"
agentensemble.delegation.target = "procurement"

The user deploys their choice of backend: Jaeger, Grafana Tempo, Zipkin, Datadog. The framework is backend-agnostic.

The existing ExecutionTrace gains:

  • traceId field linking to the distributed trace
  • Cross-ensemble DelegationTrace (extends the existing agent-level DelegationTrace)
  • parentTraceId on receiving ensemble’s trace

agentensemble-viz can show: “This trace was part of a cross-ensemble delegation from maintenance” with a link to the full distributed trace in the external trace viewer.

Each ensemble exposes Prometheus/Micrometer metrics for K8s HPA:

agentensemble_active_tasks{ensemble="front-desk"} 8
agentensemble_queued_requests{ensemble="front-desk"} 12
agentensemble_max_concurrent{ensemble="front-desk"} 10
agentensemble_capacity_utilization{ensemble="front-desk"} 0.95

Token counts are attributes on OTel spans and in ExecutionTrace. Aggregate cost is available via Micrometer gauges. Hard budgets are not enforced by the framework in v3.0.0; cost control is achieved via control plane directives that switch LLM model tiers.


Framework handles semantics, infrastructure handles transport

Section titled “Framework handles semantics, infrastructure handles transport”
ConcernOwner
Timeouts (per cross-ensemble call)Framework
Retry policies (transient vs business error distinction)Framework
Circuit breakers (per remote ensemble)Framework
Fallback strategies (alternative provider, degraded response)Framework
TLS, connection pooling, load balancingInfrastructure (K8s, Istio)
Health checks, auto-scalingInfrastructure (K8s HPA)
NetworkTask.from("procurement", "purchase-parts")
.timeout(Duration.ofMinutes(30))
.connectTimeout(Duration.ofSeconds(10))
NetworkTask.from("procurement", "purchase-parts")
.retryPolicy(RetryPolicy.builder()
.maxAttempts(3)
.backoff(Duration.ofSeconds(5), Duration.ofMinutes(1))
.retryOn(ConnectionFailure.class, TimeoutException.class)
.noRetryOn(TaskFailureResponse.class)
.build())

Retry transient failures (connection lost, timeout). Do not retry business errors (procurement says “no vendors available”).

NetworkTask.from("procurement", "purchase-parts")
.circuitBreaker(CircuitBreaker.builder()
.failureThreshold(5)
.windowDuration(Duration.ofMinutes(1))
.halfOpenAfter(Duration.ofMinutes(5))
.build())
NetworkTask.from("procurement", "purchase-parts")
.onFailure(Fallback.delegateTo("procurement-backup", "purchase-parts"))

Natural language is the compatibility layer

Section titled “Natural language is the compatibility layer”

The contract between ensembles is natural language: task descriptions and outputs. The LLM in each ensemble interprets the response regardless of exact wording. Minor changes in how an ensemble phrases its output do not break callers.

  • Shared task/tool contracts are defined by name + natural language description
  • If semantics change fundamentally, use a new task name (not a version bump)
  • Structured output types (Java records) are optional; when present, use @JsonIgnoreProperties(ignoreUnknown = true) for forward compatibility

/\
/ \ C. Chaos drills (rare, staging environment)
/ \ Real ensembles, real humans, controlled failure injection
/------\
/ \ B. Simulation (regular, cheap)
/ \ Full network, simulated LLMs, scenario-driven
/------------\
/ \ A. Component tests (frequent, fast)
/ \ Stubs, contract tests, unit tests per ensemble
/------------------\

Test each ensemble in isolation. Mock cross-ensemble calls:

NetworkTask procurementStub = NetworkTask.stub("procurement", "purchase-parts",
"Ordered from SupplyCo, PO #4821, delivery Thursday");
Ensemble maintenance = Ensemble.builder()
.task(Task.builder()
.description("Fix boiler in building 2")
.tools(procurementStub)
.build())
.build();
EnsembleOutput output = maintenance.run();

Contract tests verify both sides independently:

// Maintenance side: verify request format
NetworkTask recorder = NetworkTask.recording("procurement", "purchase-parts");
// ... run ensemble ...
assertThat(recorder.lastRequest()).contains("valve", "building 2");
// Procurement side: verify response format
EnsembleOutput output = procurementEnsemble.run(Map.of(
"request", "Order replacement valve model X-420, urgent"));
assertThat(output.getRaw()).contains("PO #", "delivery");

B. Simulation — computer-model the evacuation

Section titled “B. Simulation — computer-model the evacuation”
Simulation sim = Simulation.builder()
.network(hotelNetwork)
.scenario(Scenario.builder()
.name("Conference peak load")
.load("front-desk", LoadProfile.ramp(0, 200, Duration.ofMinutes(30)))
.failure("kitchen", FailureProfile.downAt(Duration.ofMinutes(15),
Duration.ofMinutes(5)))
.latency("procurement", LatencyProfile.multiply(3.0))
.build())
.chatModel(SimulationChatModel.fast())
.timeCompression(60)
.build();
SimulationResult result = sim.run();
result.getBottlenecks();
result.getFailureCascades();
result.getCapacityReport();
result.getTokenEstimate();

C. Chaos engineering — run the drill with real people

Section titled “C. Chaos engineering — run the drill with real people”

Built into the framework, not bolted on:

ChaosExperiment experiment = ChaosExperiment.builder()
.name("Kitchen outage during dinner rush")
.against(hotelNetwork)
.at(Duration.ofMinutes(5), Fault.kill("kitchen"))
.at(Duration.ofMinutes(10), Fault.restore("kitchen"))
.at(Duration.ofMinutes(3), Fault.latency("procurement", Duration.ofSeconds(30)))
.expect(Assertion.circuitBreakerOpens("room-service", "kitchen",
within(Duration.ofSeconds(30))))
.expect(Assertion.fallbackActivated("room-service",
within(Duration.ofMinutes(1))))
.expect(Assertion.noDataLoss())
.build();
ChaosReport report = experiment.run();
  • NetworkTask.stub(name, taskName, response) — canned response
  • NetworkTask.recording(name, taskName) — records requests for assertion
  • NetworkTool.stub(name, toolName, result) — same for tools
  • SimulationChatModel.fast() — generates realistic-shaped responses without real LLM calls
  • ChaosExperiment builder — fault injection with assertions

public enum AuditLevel {
OFF, // No audit trail (dev/test)
MINIMAL, // Cross-ensemble requests and responses only
STANDARD, // + human decisions, review gates, priority changes
FULL // + LLM prompts/responses, tool I/O, memory reads/writes
}

Configurable per ensemble or at the network level:

EnsembleNetwork.builder()
.auditLevel(AuditLevel.STANDARD)
.ensemble("accounting", Ensemble.builder()
.auditLevel(AuditLevel.FULL)
.build())
.build();

Audit level can escalate based on conditions:

AuditPolicy policy = AuditPolicy.builder()
.defaultLevel(AuditLevel.MINIMAL)
.rule(AuditRule.when("capacity_utilization > 0.8")
.escalateTo(AuditLevel.STANDARD).on("kitchen"))
.rule(AuditRule.when("task_failed")
.escalateTo(AuditLevel.FULL).on("*")
.duration(Duration.ofMinutes(10)))
.rule(AuditRule.when("human_connected AND role == 'manager'")
.escalateTo(AuditLevel.STANDARD).on("*"))
.rule(AuditRule.schedule("18:00-22:00")
.escalateTo(AuditLevel.STANDARD).on("kitchen", "room-service"))
.build();

Trigger types: metric-driven, event-driven, time-based, human-triggered. Escalations are temporary and revert when the condition clears or the duration expires.

.auditSink(AuditSink.log()) // SLF4J structured logging
.auditSink(AuditSink.database(ds)) // JDBC -- immutable append-only
.auditSink(AuditSink.eventStream()) // Kafka for downstream consumers

All audit records are immutable, append-only, timestamped, and correlatable via trace ID.


Every cross-ensemble request carries a caller-generated requestId. If the receiver sees the same requestId twice, it returns the cached result instead of re-executing. The idempotency cache has a configurable TTL.

Callers can opt into result caching with a cacheKey and maxAge:

{
"requestId": "rs-8801",
"task": "check-inventory",
"cacheKey": "kitchen:inventory:2026-03-06",
"cachePolicy": "USE_CACHED",
"maxAge": "PT1H"
}

The cache is backed by a pluggable shared store (SPI):

.resultCache(ResultCache.inMemory()) // dev/test
.resultCache(ResultCache.redis(client)) // production

Within a single WebSocket connection, TCP guarantees ordering. Across reconnections or federation, ordering is not guaranteed. For most use cases, each cross-ensemble request is independent. When ordering matters, the caller uses the async mode’s onComplete callback to chain dependent requests.


network.sharedMemory("guest-preferences", SharedMemory.builder()
.store(MemoryStore.embeddings(embeddingModel, store))
.consistency(Consistency.EVENTUAL)
.build());
network.sharedMemory("room-assignments", SharedMemory.builder()
.store(MemoryStore.redis(client))
.consistency(Consistency.LOCKED)
.lockProvider(LockProvider.redis(client))
.build());
network.sharedMemory("inventory-count", SharedMemory.builder()
.store(MemoryStore.redis(client))
.consistency(Consistency.OPTIMISTIC)
.build());
ModelBehaviorUse case
EVENTUALLast-write-wins, no coordinationContext, preferences, notes
LOCKEDDistributed lock before writeRoom assignments, exclusive access
OPTIMISTICCompare-and-swap, retry on conflictCounters, inventory
EXTERNALFramework does not manage; user’s tools handle itDatabase-backed state

The LockProvider is an SPI: Redis, ZooKeeper, database advisory locks.


STARTING -> READY -> DRAINING -> STOPPED
StateBehaviorK8s readiness probe
STARTINGConnecting to network, registering capabilitiesfalse
READYAccepting and processing worktrue
DRAININGStop pulling new work; finish in-flight; deliver resultsfalse
STOPPEDAll in-flight complete; de-registered; connections closedN/A (pod exits)

When draining (SIGTERM, human directive, or profile switch):

  1. K8s readiness probe flips to false (Service stops routing new connections)
  2. Ensemble stops pulling new work from the durable queue
  3. In-flight tasks continue until completion or drain timeout
  4. Scheduled tasks stop running
  5. Results for completed in-flight work are delivered
  6. Unprocessed work remains in the durable queue for other replicas
Ensemble.builder()
.drainTimeout(Duration.ofMinutes(5))
.build();

K8s terminationGracePeriodSeconds should match drainTimeout.

The existing WebSocketServer gains health and lifecycle endpoints:

EndpointPurpose
GET /api/health/liveLiveness probe: is the process alive?
GET /api/health/readyReadiness probe: is the ensemble accepting work?
POST /api/lifecycle/drainTrigger DRAINING state (used by K8s preStop hook)
GET /api/statusStatus endpoint (existing, extended with lifecycle state)

Simple mode (dev, single JVM):

  • In-process queues (ConcurrentLinkedQueue)
  • Direct WebSocket for request/response
  • No external infrastructure
  • Default for local development

Durable mode (production, K8s):

  • External request queue: Redis Streams, SQS, Kafka (pluggable)
  • External result store: Redis (pluggable)
  • WebSocket used for streaming events and human interaction
  • Survives pod restarts, supports horizontal scaling
EnsembleNetwork.builder()
.transport(Transport.websocket()) // simple mode (default)
.transport(Transport.durable( // production mode
RequestQueue.redis(redisClient),
ResultStore.redis(redisClient)))
.build();

In durable mode, the request and response paths are decoupled:

  • Request path: WorkRequest goes to a durable queue, any pod picks it up
  • Response path: Result is written to the shared result store keyed by requestId, or delivered via the caller-specified delivery method

This means the pod that processes the request may be different from the pod that received it, and the pod that delivers the result may be different from the pod that processed it. The requestId is the correlation key that holds it together.


ModulePurposeRequired
agentensemble-coreTask, Ensemble, Agent, workflow engine (unchanged)Yes
agentensemble-networkNew: EnsembleNetwork, NetworkTask, NetworkTool, WorkRequest, capability sharing, discovery, priority queue, lifecycleOptional
agentensemble-webExtended: human portal, role-based scoping, directives, multi-ensemble dashboardOptional
agentensemble-telemetry-opentelemetryNew: OTel span integration, W3C trace context propagationOptional
agentensemble-chaosNew: ChaosExperiment, fault injection, simulationOptional (test scope)
agentensemble-transport-redisNew: Redis-backed queue, result store, lock provider, result cacheOptional
agentensemble-transport-kafkaNew: Kafka-backed queue and topic deliveryOptional
agentensemble-memoryExisting: MemoryStore SPI (extended with consistency models)Optional
agentensemble-reviewExisting: ReviewHandler SPI (extended with requiredRole)Optional
agentensemble-vizExtended: /network route, multi-ensemble dashboard, simulation viewOptional
agentensemble-metrics-micrometerExisting: Micrometer metrics (extended with network metrics)Optional
agentensemble-devtoolsExisting: DagExporter (extended with network topology)Optional

Building on the v2.1.0 protocol (Section 4 of design doc 16), the following message types are added for network communication:

MessagePurpose
ensemble_registerRegister capabilities (shared tasks/tools) on the network
ensemble_deregisterDe-register when shutting down
capacity_updatePeriodic capacity advertisement (load, queue depth, shareable)
MessagePurpose
task_requestWorkRequest: request execution of a shared task
task_acceptedACK with queue position and ETA
task_progressOptional streaming progress update during execution
task_responseWork result (completed, failed, or rejected)
tool_requestRequest execution of a shared tool
tool_responseTool result
MessagePurpose
directiveNon-blocking guidance or control plane command
queryRequest information from an ensemble
query_responseResponse to a query
review_decisionExisting: human approves/edits/exits a review gate
MessagePurpose
notificationAlert sent to a qualified human
review_requestedExisting: review gate waiting for human decision
MessagePurpose
profile_appliedOperational profile change notification
capability_queryDiscovery: who provides this capability?
capability_responseDiscovery: list of providers

The framework defines an SPI for authentication and authorization but does not implement specific mechanisms in v3.0.0. The expected production deployment uses K8s infrastructure:

  • mTLS via service mesh (Istio, Linkerd) for ensemble-to-ensemble authentication
  • K8s Network Policies for namespace-level access control
  • RBAC for human roles (integrated with the organization’s identity provider)

The requiredRole field on review gates is enforced by the framework. The mapping from authenticated user to role is provided by the user’s auth integration.

For local development, the existing localhost-only binding from v2.1.0 WebDashboard is retained. Non-localhost binding logs a warning.

Different ensembles may use different LLM API keys. The framework does not manage secrets; K8s Secrets and environment variables are the expected mechanism. The wire protocol must never include API keys or credentials.


apiVersion: apps/v1
kind: Deployment
metadata:
name: kitchen
namespace: hotel-downtown
spec:
replicas: 2
template:
spec:
terminationGracePeriodSeconds: 300
containers:
- name: kitchen
image: hotel/kitchen-ensemble:latest
ports:
- containerPort: 7329
env:
- name: ENSEMBLE_NAME
value: "kitchen"
- name: REDIS_URL
value: "redis://redis:6379"
livenessProbe:
httpGet:
path: /api/health/live
port: 7329
readinessProbe:
httpGet:
path: /api/health/ready
port: 7329
lifecycle:
preStop:
httpGet:
path: /api/lifecycle/drain
port: 7329
---
apiVersion: v1
kind: Service
metadata:
name: kitchen
namespace: hotel-downtown
spec:
selector:
app: kitchen
ports:
- port: 7329
targetPort: 7329
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: kitchen-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: kitchen
minReplicas: 1
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: agentensemble_queued_requests
target:
type: AverageValue
averageValue: 5

  • Long-running ensemble mode (Ensemble.start())
  • shareTask() / shareTool() on Ensemble builder
  • NetworkTask / NetworkTool implementations (WebSocket transport)
  • WorkRequest envelope and wire protocol extensions
  • Capability handshake on WebSocket connect
  • Simple mode transport (in-process, direct WebSocket)
  • K8s health and lifecycle endpoints

Phase 2: Durable Transport and Delivery (v3.0.0-beta)

Section titled “Phase 2: Durable Transport and Delivery (v3.0.0-beta)”
  • Durable queue integration (Redis Streams, Kafka)
  • Pluggable delivery methods (Queue, Topic, Webhook, Store, BroadcastClaim)
  • Pluggable ingress methods (Queue, HTTP API, Topic subscription)
  • Result caching with shared store
  • Idempotency with TTL-based cache
  • Priority queuing with aging
  • Scheduled/proactive tasks

Phase 3: Human Participation and Observability (v3.0.0-rc)

Section titled “Phase 3: Human Participation and Observability (v3.0.0-rc)”
  • Multi-ensemble dashboard (/network route in viz)
  • Human directives (non-blocking guidance)
  • Role-based gated reviews (requiredRole)
  • OpenTelemetry integration module
  • Leveled audit trail with dynamic rules
  • Control plane directives (model tier switching, profile application)
  • Shared memory with configurable consistency (EVENTUAL/LOCKED/OPTIMISTIC)
  • Federation (cross-namespace, cross-cluster capability sharing)
  • Operational profiles with scheduled switching
  • Capability-based discovery and NetworkToolCatalog
  • Simulation mode (SimulationChatModel, time compression, scenario builder)
  • Chaos engineering (ChaosExperiment, fault injection, assertions)

The initial design explored a top-down conductor that orchestrates a DAG of ensembles. This was rejected in favor of the peer-to-peer mesh model because:

  • Real systems are decentralized (room service talks to maintenance directly)
  • A central conductor is a single point of failure
  • The mesh model scales naturally with K8s Service discovery
  • Humans are participants, not controllers

Bidirectional communication over a single connection. The existing v2.1.0 agentensemble-web module already uses WebSocket for streaming events and review decisions. Extending it to ensemble-to-ensemble communication is natural.

WebSocket connections are ephemeral. Pod restarts, network blips, and horizontal scaling all break connections. For reliable work delivery, the durable queue (Redis Streams, Kafka) is the backbone. WebSocket is used for real-time events and human interaction — not for critical work delivery.

Why natural language contracts instead of typed schemas?

Section titled “Why natural language contracts instead of typed schemas?”

LLM-based agents interpret natural language. The output of a cross-ensemble task is natural language (optionally with structured output). Schema versioning is unnecessary because the LLM is the compatibility layer. This dramatically simplifies inter-ensemble contracts.

LLM tasks are inherently async. A task taking 5 minutes is normal. Rejecting requests under load is worse than queuing them. The caller-side SLA (deadline) puts the routing intelligence where it belongs — with the requester, who knows their own constraints.

External chaos tools (Chaos Monkey, Litmus) operate at the infrastructure level. They cannot inject application-level faults (drop specific message types, simulate LLM timeout, degrade specific capabilities). The framework owns the network layer and can inject precise, semantic faults.