Durable Transport for Agent Networks: Moving from In-Process Queues to Kafka
In-process queues are fine for development. They are fast, deterministic, and require zero infrastructure. But they have a property that becomes a liability in production: when the process dies, the queue contents disappear.
For agent networks that run as long-lived services — handling work requests over hours or days — losing queued requests on restart is not acceptable. The transport layer needs durability, and that means moving from in-process data structures to something that survives process failures.
What durability means for agent networks
Section titled “What durability means for agent networks”An agent ensemble network has three communication patterns that need durable backing:
- Work request delivery — a request from one ensemble to another should not be lost if the receiving ensemble is temporarily unavailable
- Response routing — when an ensemble completes a request, the response needs to reach the original caller even if the caller restarted
- Capability advertisement — shared tasks and tools should remain discoverable across process restarts
Kafka as the transport backing
Section titled “Kafka as the transport backing”The agentensemble-transport-kafka module implements the transport SPIs against Apache Kafka:
KafkaTransportConfig config = KafkaTransportConfig.builder() .bootstrapServers("kafka:9092") .consumerGroupId("kitchen-ensemble") .topicPrefix("agentensemble.") .build();Request queues
Section titled “Request queues”KafkaRequestQueue produces work requests to a Kafka topic and consumes them with manual offset commits. If the ensemble crashes mid-processing, the request will be redelivered on restart.
Priority queues with aging
Section titled “Priority queues with aging”For workloads where some requests are more urgent than others, PriorityRequestQueue adds priority levels with aging to prevent starvation. Requests that wait longer than the aging interval are promoted to the next higher priority level.
What changes operationally
Section titled “What changes operationally”Moving from in-process to Kafka transport changes the operational profile:
- Startup behavior — with Kafka, ensembles may start with a backlog of unprocessed requests
- Failure modes — infrastructure-level errors (broker unavailable) rather than process-fatal
- Monitoring — consumer lag, partition health, broker connectivity
- Ordering — per-partition ordering, not strict FIFO
The configuration boundary
Section titled “The configuration boundary”The ensemble does not know it is using Kafka — it interacts with transport SPIs. The Kafka-specific configuration lives in the infrastructure layer:
// Infrastructure layerKafkaRequestQueue queue = KafkaRequestQueue.builder() .config(kafkaConfig) .ensembleName("kitchen") .build();
// Application layer (transport-agnostic)Ensemble kitchen = Ensemble.builder() .chatLanguageModel(model) .task(Task.of("Manage kitchen operations")) .requestQueue(queue) .build();Same ensemble code works in development (in-process queues) and production (Kafka) without changes.
Tradeoffs
Section titled “Tradeoffs”At-least-once delivery. A request may be processed twice if the ensemble crashes after completing work but before committing the offset. For most agent workloads (non-deterministic anyway), this is acceptable.
Operational complexity. Kafka needs to be provisioned, monitored, and maintained. For small deployments, the overhead may not be justified.
Latency. Kafka adds millisecond-scale latency. For agent workloads where execution takes seconds or minutes, this is negligible.
The Kafka transport module is part of AgentEnsemble. The durable transport guide covers the full configuration and operational details.