Skip to content
AgentEnsemble AgentEnsemble
Get Started

Durable Transport for Agent Networks: Moving from In-Process Queues to Kafka

In-process queues are fine for development. They are fast, deterministic, and require zero infrastructure. But they have a property that becomes a liability in production: when the process dies, the queue contents disappear.

For agent networks that run as long-lived services — handling work requests over hours or days — losing queued requests on restart is not acceptable. The transport layer needs durability, and that means moving from in-process data structures to something that survives process failures.

An agent ensemble network has three communication patterns that need durable backing:

  1. Work request delivery — a request from one ensemble to another should not be lost if the receiving ensemble is temporarily unavailable
  2. Response routing — when an ensemble completes a request, the response needs to reach the original caller even if the caller restarted
  3. Capability advertisement — shared tasks and tools should remain discoverable across process restarts

The agentensemble-transport-kafka module implements the transport SPIs against Apache Kafka:

KafkaTransportConfig config = KafkaTransportConfig.builder()
.bootstrapServers("kafka:9092")
.consumerGroupId("kitchen-ensemble")
.topicPrefix("agentensemble.")
.build();

KafkaRequestQueue produces work requests to a Kafka topic and consumes them with manual offset commits. If the ensemble crashes mid-processing, the request will be redelivered on restart.

For workloads where some requests are more urgent than others, PriorityRequestQueue adds priority levels with aging to prevent starvation. Requests that wait longer than the aging interval are promoted to the next higher priority level.

Moving from in-process to Kafka transport changes the operational profile:

  • Startup behavior — with Kafka, ensembles may start with a backlog of unprocessed requests
  • Failure modes — infrastructure-level errors (broker unavailable) rather than process-fatal
  • Monitoring — consumer lag, partition health, broker connectivity
  • Ordering — per-partition ordering, not strict FIFO

The ensemble does not know it is using Kafka — it interacts with transport SPIs. The Kafka-specific configuration lives in the infrastructure layer:

// Infrastructure layer
KafkaRequestQueue queue = KafkaRequestQueue.builder()
.config(kafkaConfig)
.ensembleName("kitchen")
.build();
// Application layer (transport-agnostic)
Ensemble kitchen = Ensemble.builder()
.chatLanguageModel(model)
.task(Task.of("Manage kitchen operations"))
.requestQueue(queue)
.build();

Same ensemble code works in development (in-process queues) and production (Kafka) without changes.

At-least-once delivery. A request may be processed twice if the ensemble crashes after completing work but before committing the offset. For most agent workloads (non-deterministic anyway), this is acceptable.

Operational complexity. Kafka needs to be provisioned, monitored, and maintained. For small deployments, the overhead may not be justified.

Latency. Kafka adds millisecond-scale latency. For agent workloads where execution takes seconds or minutes, this is negligible.


The Kafka transport module is part of AgentEnsemble. The durable transport guide covers the full configuration and operational details.