Skip to content

Rate Limiting

Rate limiting caps the number of LLM API requests per time window. It is especially important in parallel workflows where multiple agents share the same API key and could exceed provider quotas simultaneously.


Rate limiting uses a token-bucket algorithm: a token is added to the bucket at a fixed interval (e.g. every 500 ms for 2 req/sec). Each LLM request consumes one token. When no token is available, the calling thread blocks until one is ready or a wait timeout expires.

RateLimit describes the bucket refill rate. RateLimitedChatModel is a decorator that wraps any ChatModel with a bucket.


// Factory methods
RateLimit.of(60, Duration.ofMinutes(1)) // 60 requests per minute
RateLimit.perMinute(60) // convenience alias
RateLimit.perSecond(2) // 2 requests per second

Ensemble level (task-first, shared bucket)

Section titled “Ensemble level (task-first, shared bucket)”

The most common usage. All synthesized agents that inherit the ensemble model share one bucket.

EnsembleOutput result = Ensemble.builder()
.chatLanguageModel(openAiModel)
.rateLimit(RateLimit.perMinute(60)) // wraps chatLanguageModel at run time
.task(Task.of("Research AI trends"))
.task(Task.of("Write a summary report"))
.build()
.run();

All tasks that inherit the ensemble model share the same token bucket, which is created once per ensemble.run() call. Tasks with their own chatLanguageModel or rateLimit are not affected.

Apply a rate limit to a specific task’s LLM. Two sub-cases:

Task has its own chatLanguageModel — the model is wrapped at build time:

Task task = Task.builder()
.description("Research AI trends")
.expectedOutput("A report")
.chatLanguageModel(openAiModel)
.rateLimit(RateLimit.perMinute(30)) // wraps chatLanguageModel
.build();

Task inherits the ensemble model — the rate limit is stored on the task and applied to the inherited model (creating a separate bucket from the ensemble-level limit):

Task task = Task.builder()
.description("Research AI trends")
.expectedOutput("A report")
.rateLimit(RateLimit.perMinute(30)) // applied when ensemble assigns a model
.build();

For explicit agents, use Agent.builder().rateLimit():

Agent researcher = Agent.builder()
.role("Researcher")
.goal("Find the latest AI developments")
.llm(openAiModel)
.rateLimit(RateLimit.perMinute(60)) // wraps llm at build time
.build();

Understanding which pattern gives you a shared bucket vs an independent one is key to getting the behaviour you expect.

Ensemble .rateLimit() — shared across all tasks

Section titled “Ensemble .rateLimit() — shared across all tasks”

Ensemble.builder().rateLimit(limit) creates one RateLimitedChatModel once per run() call and gives it to every synthesized agent that inherits the ensemble model. All those agents share the same token bucket.

// One bucket: all tasks compete for the same 60 req/min allowance
Ensemble.builder()
.chatLanguageModel(openAiModel)
.rateLimit(RateLimit.perMinute(60)) // shared bucket for the whole ensemble
.task(Task.of("Research AI trends"))
.task(Task.of("Analyse the findings"))
.task(Task.of("Write an executive summary"))
.build()
.run();

This is the right choice when you have one API key and want to enforce a global request cap across an entire run.

Task or Agent .rateLimit() — independent bucket per task/agent

Section titled “Task or Agent .rateLimit() — independent bucket per task/agent”

Each .rateLimit() on a Task or Agent builder creates a new, separate token bucket for that task or agent. Two tasks both configured with perMinute(30) each get their own 30 req/min allowance — they do not share.

// TWO independent buckets: task1 has 30 req/min, task2 has 30 req/min (separate)
var task1 = Task.builder()
.description("Research AI trends")
.chatLanguageModel(openAiModel)
.rateLimit(RateLimit.perMinute(30)) // bucket A
.build();
var task2 = Task.builder()
.description("Write a summary")
.chatLanguageModel(openAiModel)
.rateLimit(RateLimit.perMinute(30)) // bucket B (independent from A)
.build();

Use this when different tasks or agents have different quotas (e.g. a fast model with a higher cap and a slow model with a lower one).

Explicit shared instance — share across selected tasks/agents

Section titled “Explicit shared instance — share across selected tasks/agents”

To share one bucket across a subset of tasks or agents, create one RateLimitedChatModel instance and pass it explicitly wherever you want it:

// One bucket shared by researcher and writer; analyst has its own
var shared = RateLimitedChatModel.of(openAiModel, RateLimit.perMinute(60));
var separate = RateLimitedChatModel.of(openAiModel, RateLimit.perMinute(20));
var researcher = Agent.builder().role("Researcher").goal("Research").llm(shared).build();
var writer = Agent.builder().role("Writer").goal("Write").llm(shared).build();
var analyst = Agent.builder().role("Analyst").goal("Analyse").llm(separate).build();

This works for explicit agents. For task-first (agentless) tasks, pass the shared model as chatLanguageModel:

var shared = RateLimitedChatModel.of(openAiModel, RateLimit.perMinute(60));
var task1 = Task.builder()
.description("Research")
.chatLanguageModel(shared) // shares bucket with task2
.build();
var task2 = Task.builder()
.description("Write")
.chatLanguageModel(shared) // same bucket
.build();
ApproachBucket sharing
Ensemble.builder().rateLimit()Shared across all synthesized agents that use the ensemble model
Task.builder().rateLimit()Independent per task
Agent.builder().rateLimit()Independent per agent
RateLimitedChatModel.of(model, limit) passed to multiple agents/tasksShared (same object instance = same bucket)

By default, threads wait up to 30 seconds for a token. If no token is available within the timeout, RateLimitTimeoutException is thrown.

Customise the timeout with the three-argument factory:

var model = RateLimitedChatModel.of(
openAiModel,
RateLimit.perMinute(60),
Duration.ofSeconds(60) // wait up to 60 seconds before timing out
);

When timeout is exceeded, RateLimitTimeoutException propagates up as a TaskExecutionException. Handle it or increase the timeout.


RateLimitedChatModel is thread-safe. Multiple threads (parallel workflow virtual threads) can call chat() concurrently and correctly share the token bucket via ReentrantLock.


The token-bucket implementation uses only java.util.concurrent.locks. No third-party rate-limiting library is required.