Durable Transport for Agent Networks: Moving from In-Process Queues to Kafka

May 9, 2026

In-process queues are fine for development. They are fast, deterministic, and require zero infrastructure. But they have a property that becomes a liability in production: when the process dies, the queue contents disappear.

For agent networks that run as long-lived services — handling work requests over hours or days — losing queued requests on restart is not acceptable. The transport layer needs durability, and that means moving from in-process data structures to something that survives process failures.

What durability means for agent networks

An agent ensemble network has three communication patterns that need durable backing:

Work request delivery — a request from one ensemble to another should not be lost if the receiving ensemble is temporarily unavailable
Response routing — when an ensemble completes a request, the response needs to reach the original caller even if the caller restarted
Capability advertisement — shared tasks and tools should remain discoverable across process restarts

Kafka as the transport backing

The agentensemble-transport-kafka module implements the transport SPIs against Apache Kafka:

KafkaTransportConfig config = KafkaTransportConfig.builder()
    .bootstrapServers("kafka:9092")
    .consumerGroupId("kitchen-ensemble")
    .topicPrefix("agentensemble.")
    .build();

Request queues

KafkaRequestQueue produces work requests to a Kafka topic and consumes them with manual offset commits. If the ensemble crashes mid-processing, the request will be redelivered on restart.

Priority queues with aging

For workloads where some requests are more urgent than others, PriorityRequestQueue adds priority levels with aging to prevent starvation. Requests that wait longer than the aging interval are promoted to the next higher priority level.

What changes operationally

Moving from in-process to Kafka transport changes the operational profile:

Startup behavior — with Kafka, ensembles may start with a backlog of unprocessed requests
Failure modes — infrastructure-level errors (broker unavailable) rather than process-fatal
Monitoring — consumer lag, partition health, broker connectivity
Ordering — per-partition ordering, not strict FIFO

The configuration boundary

The ensemble does not know it is using Kafka — it interacts with transport SPIs. The Kafka-specific configuration lives in the infrastructure layer:

// Infrastructure layer
KafkaRequestQueue queue = KafkaRequestQueue.builder()
    .config(kafkaConfig)
    .ensembleName("kitchen")
    .build();

// Application layer (transport-agnostic)
Ensemble kitchen = Ensemble.builder()
    .chatLanguageModel(model)
    .task(Task.of("Manage kitchen operations"))
    .requestQueue(queue)
    .build();

Same ensemble code works in development (in-process queues) and production (Kafka) without changes.

Tradeoffs

At-least-once delivery. A request may be processed twice if the ensemble crashes after completing work but before committing the offset. For most agent workloads (non-deterministic anyway), this is acceptable.

Operational complexity. Kafka needs to be provisioned, monitored, and maintained. For small deployments, the overhead may not be justified.

Latency. Kafka adds millisecond-scale latency. For agent workloads where execution takes seconds or minutes, this is negligible.

The Kafka transport module is part of AgentEnsemble. The durable transport guide covers the full configuration and operational details.