Skip to content
AgentEnsemble AgentEnsemble
Get Started

Long-Running Ensembles

AgentEnsemble v3.0 introduces long-running mode: an ensemble that starts, listens for work, and runs continuously until explicitly stopped. This is the foundation for the Ensemble Network — distributed multi-ensemble systems where autonomous ensembles communicate peer-to-peer.

ModeDescriptionExample
One-shot (run())Execute tasks, return output, done.Research + report generation
Long-running (start())Bind a port, accept work, run until stopped.Kitchen service in a hotel

The existing Ensemble.run() API is completely unchanged.

A long-running ensemble transitions through four states:

STARTING -> READY -> DRAINING -> STOPPED
StateBehaviorAccepting work?
STARTINGBinding server port, registering capabilitiesNo
READYRunning, accepting and processing workYes
DRAININGFinishing in-flight work, rejecting new requestsNo
STOPPEDShutdown complete, connections closedNo

Long-running mode requires a dashboard for WebSocket connectivity. Configure one via .webDashboard(...) before calling start():

// 1. Create the WebDashboard bound to the desired port
WebDashboard dashboard = WebDashboard.builder().port(7329).build();
// 2. Build the ensemble with the dashboard wired in
Ensemble kitchen = Ensemble.builder()
.chatLanguageModel(model)
.task(Task.of("Manage kitchen operations"))
.shareTask("prepare-meal", mealTask)
.shareTool("check-inventory", inventoryTool)
.webDashboard(dashboard) // required; also starts the server
.build();
// 3. Transition to READY state and register the shutdown hook
kitchen.start(7329); // port is advisory for error messages / logs
// ... ensemble runs until stopped ...
kitchen.stop(); // DRAINING -> STOPPED
  • Calling start() on an already-started ensemble is a no-op.
  • Calling stop() on an already-stopped or never-started ensemble is a no-op.

When stop() is called, the ensemble transitions to DRAINING, stops the WebSocket server (if this ensemble owns the dashboard lifecycle), and then transitions to STOPPED.

The drainTimeout field is available for configuration and will be used by a future implementation that waits for in-flight tasks to complete before stopping.

A JVM shutdown hook is automatically registered so that SIGTERM triggers graceful shutdown.

Ensemble kitchen = Ensemble.builder()
.chatLanguageModel(model)
.task(Task.of("Manage kitchen operations"))
.drainTimeout(Duration.ofMinutes(2)) // Configurable; default: 5 minutes
.build();

Long-running ensembles can share capabilities with the network:

A shared task is a full task that other ensembles can delegate work to:

Task mealTask = Task.builder()
.description("Prepare a meal as specified")
.expectedOutput("Confirmation with preparation details and timing")
.build();
Ensemble.builder()
.chatLanguageModel(model)
.task(Task.of("Manage kitchen operations"))
.shareTask("prepare-meal", mealTask)
.build();

A shared tool is a single tool that other ensembles’ agents can invoke remotely:

Ensemble.builder()
.chatLanguageModel(model)
.task(Task.of("Manage kitchen operations"))
.shareTool("check-inventory", inventoryTool)
.shareTool("dietary-check", allergyCheckTool)
.build();
  • Shared capability names must be unique within an ensemble.
  • Names must not be null or blank.
  • Task/tool references must not be null.

When a client connects to a long-running ensemble via WebSocket, the server sends a hello message that includes the ensemble’s shared capabilities. Because HelloMessage uses @JsonInclude(NON_NULL), null fields are omitted from the wire payload:

{
"type": "hello",
"ensembleId": "run-abc123",
"sharedCapabilities": [
{"name": "prepare-meal", "description": "Prepare a meal as specified", "type": "TASK"},
{"name": "check-inventory", "description": "Check ingredient availability", "type": "TOOL"}
]
}

This is backward compatible with v2.x clients because MessageSerializer configures Jackson with FAIL_ON_UNKNOWN_PROPERTIES = false, so older clients simply ignore the new sharedCapabilities field.

Long-running ensembles expose HTTP endpoints for Kubernetes health probes and lifecycle management:

EndpointMethodPurpose
/api/health/liveGETLiveness probe — returns 200 when the process is alive
/api/health/readyGETReadiness probe — returns 200 only in READY state; 503 otherwise
/api/lifecycle/drainPOSTTriggers transition to DRAINING state
/api/statusGETExtended status including lifecycleState field
apiVersion: apps/v1
kind: Deployment
metadata:
name: kitchen
spec:
replicas: 2
template:
spec:
terminationGracePeriodSeconds: 300 # Match drainTimeout
containers:
- name: kitchen
image: hotel/kitchen-ensemble:latest
ports:
- containerPort: 7329
livenessProbe:
httpGet:
path: /api/health/live
port: 7329
readinessProbe:
httpGet:
path: /api/health/ready
port: 7329
lifecycle:
preStop:
httpGet:
path: /api/lifecycle/drain
port: 7329

Set terminationGracePeriodSeconds to match the ensemble’s drainTimeout so that Kubernetes waits long enough for in-flight work to complete.

Other ensembles can use shared tasks and tools via NetworkTask and NetworkTool:

NetworkConfig config = NetworkConfig.builder()
.ensemble("kitchen", "ws://kitchen:7329/ws")
.build();
try (NetworkClientRegistry registry = new NetworkClientRegistry(config)) {
EnsembleOutput result = Ensemble.builder()
.chatLanguageModel(model)
.task(Task.builder()
.description("Handle room service request")
.tools(
NetworkTask.from("kitchen", "prepare-meal", registry),
NetworkTool.from("kitchen", "check-inventory", registry))
.build())
.build()
.run();
}

See the Cross-Ensemble Delegation guide for details.