Skip to content
AgentEnsemble AgentEnsemble
Get Started

Self-Optimizing Agent Tasks: Persistent Reflection Loops in Java

Task definitions are written at compile time. You describe what the task should do, wire up a model, and run. The prompt stays fixed unless you go back and edit it.

In practice, you often discover after a few runs that the instructions could be more precise. The LLM misses an edge case you didn’t anticipate. The output format drifts in ways you didn’t specify. You revise the description, redeploy, and try again.

The harder version of this problem is: what if the instructions could improve themselves?

Task reflection is a persistent, automated feedback loop built into the task execution lifecycle. After a task completes successfully — output accepted, guardrails passed, reviews approved — an LLM-backed analysis step reviews whether the task’s instructions could be improved. Improvements are stored in a ReflectionStore and injected into the task’s prompt on subsequent runs. The original task definition is never modified.

This post covers how reflection works, what the API looks like, and where the tradeoffs sit.


These two mechanisms are often confused because both involve quality analysis. The distinction is in scope and timing:

Phase ReviewTask Reflection
TriggerAfter phase completesAfter task output accepted
ScopeWithin a single Ensemble.run()Across multiple Ensemble.run() calls
PurposeFix inadequate output this runImprove instructions for future runs
PersistenceTransient — lost after runPersistent — stored between runs
Initiated byExternal reviewerAutomated LLM analysis

Phase review fixes output within a run. Task reflection improves instructions across runs. They compose: a task can have both.


Enable reflection with .reflect(true) on a task, and configure a ReflectionStore on the ensemble:

ReflectionStore store = new InMemoryReflectionStore();
Task research = Task.builder()
.description("research the top 5 trends in cloud-native Java for 2026")
.reflect(true)
.chatModel(model)
.build();
// Run 1: no prior reflections; task executes normally
EnsembleOutput run1 = Ensemble.builder()
.tasks(List.of(research))
.reflectionStore(store)
.chatModel(model)
.build()
.run();
// Reflection fires after run 1 completes; improvements stored in `store`
// Run 2: prior reflections injected into the prompt automatically
EnsembleOutput run2 = Ensemble.builder()
.tasks(List.of(research))
.reflectionStore(store)
.chatModel(model)
.build()
.run();

The store is the key. Pass the same store instance across run() calls and the accumulated reflections persist. The task definition — the Task object — is the same every run. The difference is what the reflection store contributes to the prompt.


Reflection fires at the end of the task lifecycle, after all other post-processing:

1. Task executes (LLM call)
2. Guardrails evaluate output
3. Review gate runs (if configured)
4. Memory scopes write
5. [Reflection] LLM analyzes output; generates improvement; stores in ReflectionStore

If the task fails, guardrails reject the output, or the review gate retries, reflection does not fire. Reflection only fires on a fully accepted output.

On the next run of the same task:

1. ReflectionStore loads prior reflections for this task identity
2. Reflections injected into prompt
3. Task executes with improved instructions
4. [Reflection] New analysis fires; improvement stored

The task is identified by a TaskIdentity derived from its description. Two tasks with the same description share the same reflection history.


The reflection store contributes an additional section to the task’s prompt:

[original task description]
## Instruction Refinements
Based on previous runs, the following refinements have been found to improve output quality:
- Be specific about the time range: results should cover events within the last 12 months only.
- Structure the output as a numbered list with a one-sentence summary per trend.
- For each trend, cite at least one concrete project or company as evidence.

The injection is additive. Original instructions are preserved. Reflections narrow, clarify, or extend them based on what previous outputs revealed.


By default, the framework uses all stored reflections for a task. You can bound the number injected via ReflectionConfig:

Task analysis = Task.builder()
.description("analyze customer sentiment from support tickets")
.reflect(true)
.reflectionConfig(ReflectionConfig.builder()
.maxReflections(5)
.build())
.chatModel(model)
.build();

With maxReflections(5), only the 5 most recent reflections are injected. Older reflections remain in the store but are not included in the prompt. This prevents prompt bloat as the number of runs grows.

The default strategy uses an LLM call to analyze the task output and generate an improvement. You can substitute a custom ReflectionStrategy:

public class DomainReflectionStrategy implements ReflectionStrategy {
@Override
public Optional<String> reflect(ReflectionInput input) {
String output = input.taskOutput();
// custom analysis: check for required sections, format, length
if (!output.contains("## Summary")) {
return Optional.of("Always include a ## Summary section as the first heading");
}
// no improvement identified this time
return Optional.empty();
}
}
Task.builder()
.description("write a technical design document")
.reflect(true)
.reflectionConfig(ReflectionConfig.builder()
.strategy(new DomainReflectionStrategy())
.build())
.chatModel(model)
.build();

A custom strategy can use deterministic rules, call a different model, or apply domain-specific analysis. Returning Optional.empty() skips storage for that run.


ReflectionStore is an interface with two methods:

public interface ReflectionStore {
List<TaskReflection> load(TaskIdentity identity);
void store(TaskIdentity identity, TaskReflection reflection);
}

InMemoryReflectionStore is included for development and testing. It holds reflections in a ConcurrentHashMap and loses state when the process stops.

For production, implement ReflectionStore against whatever persistence layer makes sense for your system — a relational database, a document store, or a key-value store:

public class JdbcReflectionStore implements ReflectionStore {
private final DataSource dataSource;
@Override
public List<TaskReflection> load(TaskIdentity identity) {
// SELECT content FROM task_reflections WHERE task_id = ?
// ORDER BY created_at DESC LIMIT maxReflections
}
@Override
public void store(TaskIdentity identity, TaskReflection reflection) {
// INSERT INTO task_reflections (task_id, content, created_at) VALUES (?, ?, ?)
}
}

The TaskReflection record holds the improvement text and a timestamp. TaskIdentity includes the task description hash used for keying.


Reflection is opt-in per task. Tasks without .reflect(true) are unaffected even if a ReflectionStore is configured on the ensemble. You can enable reflection for high-value tasks and leave it off for tasks where the instructions are stable or where the cost of an extra LLM call isn’t justified.

Ensemble.builder()
.tasks(List.of(
stableDataFetchTask, // no reflection
evolvingAnalysisTask, // .reflect(true)
stableFormattingTask // no reflection
))
.reflectionStore(store)
.chatModel(model)
.build()
.run();

The store is queried and written only for tasks with reflection enabled.


Reflection adds an LLM call per reflective task per run. For tasks that run thousands of times, this adds up. The cost is bounded if reflections converge — if the task’s instructions become stable after a few runs, reflections may produce no new improvements and Optional.empty() returns more often.

Reflections can drift. If a task’s purpose changes — the description is updated, the downstream context changes, the data it processes shifts — earlier reflections may no longer apply. maxReflections helps here by aging out old improvements. For significant task changes, clearing the stored reflections for that task is reasonable.

Reflection is not a substitute for good initial instructions. A task with fundamentally unclear instructions will accumulate reflections that patch around the ambiguity. The better use is to start with reasonable instructions and use reflection to sharpen them in response to real outputs over time.

The original task definition is never modified. All improvements live in the store. This is a deliberate choice: the source of truth for what a task does remains in code, not in a mutable prompt that silently drifts over time.


Guide: Task Reflection | Design: Task Reflection | GitHub

AgentEnsemble is open-source under the MIT license.