Self-Optimizing Agent Tasks: Persistent Reflection Loops in Java
Task definitions are written at compile time. You describe what the task should do, wire up a model, and run. The prompt stays fixed unless you go back and edit it.
In practice, you often discover after a few runs that the instructions could be more precise. The LLM misses an edge case you didn’t anticipate. The output format drifts in ways you didn’t specify. You revise the description, redeploy, and try again.
The harder version of this problem is: what if the instructions could improve themselves?
Task reflection is a persistent, automated feedback loop built into the task execution lifecycle. After a task completes successfully — output accepted, guardrails passed, reviews approved — an LLM-backed analysis step reviews whether the task’s instructions could be improved. Improvements are stored in a ReflectionStore and injected into the task’s prompt on subsequent runs. The original task definition is never modified.
This post covers how reflection works, what the API looks like, and where the tradeoffs sit.
Reflection vs Phase Review
Section titled “Reflection vs Phase Review”These two mechanisms are often confused because both involve quality analysis. The distinction is in scope and timing:
| Phase Review | Task Reflection | |
|---|---|---|
| Trigger | After phase completes | After task output accepted |
| Scope | Within a single Ensemble.run() | Across multiple Ensemble.run() calls |
| Purpose | Fix inadequate output this run | Improve instructions for future runs |
| Persistence | Transient — lost after run | Persistent — stored between runs |
| Initiated by | External reviewer | Automated LLM analysis |
Phase review fixes output within a run. Task reflection improves instructions across runs. They compose: a task can have both.
Quick Start
Section titled “Quick Start”Enable reflection with .reflect(true) on a task, and configure a ReflectionStore on the ensemble:
ReflectionStore store = new InMemoryReflectionStore();
Task research = Task.builder() .description("research the top 5 trends in cloud-native Java for 2026") .reflect(true) .chatModel(model) .build();
// Run 1: no prior reflections; task executes normallyEnsembleOutput run1 = Ensemble.builder() .tasks(List.of(research)) .reflectionStore(store) .chatModel(model) .build() .run();
// Reflection fires after run 1 completes; improvements stored in `store`
// Run 2: prior reflections injected into the prompt automaticallyEnsembleOutput run2 = Ensemble.builder() .tasks(List.of(research)) .reflectionStore(store) .chatModel(model) .build() .run();The store is the key. Pass the same store instance across run() calls and the accumulated reflections persist. The task definition — the Task object — is the same every run. The difference is what the reflection store contributes to the prompt.
The Execution Lifecycle
Section titled “The Execution Lifecycle”Reflection fires at the end of the task lifecycle, after all other post-processing:
1. Task executes (LLM call)2. Guardrails evaluate output3. Review gate runs (if configured)4. Memory scopes write5. [Reflection] LLM analyzes output; generates improvement; stores in ReflectionStoreIf the task fails, guardrails reject the output, or the review gate retries, reflection does not fire. Reflection only fires on a fully accepted output.
On the next run of the same task:
1. ReflectionStore loads prior reflections for this task identity2. Reflections injected into prompt3. Task executes with improved instructions4. [Reflection] New analysis fires; improvement storedThe task is identified by a TaskIdentity derived from its description. Two tasks with the same description share the same reflection history.
What Gets Injected
Section titled “What Gets Injected”The reflection store contributes an additional section to the task’s prompt:
[original task description]
## Instruction Refinements
Based on previous runs, the following refinements have been found to improve output quality:- Be specific about the time range: results should cover events within the last 12 months only.- Structure the output as a numbered list with a one-sentence summary per trend.- For each trend, cite at least one concrete project or company as evidence.The injection is additive. Original instructions are preserved. Reflections narrow, clarify, or extend them based on what previous outputs revealed.
Reflection Configuration
Section titled “Reflection Configuration”Bounding the History
Section titled “Bounding the History”By default, the framework uses all stored reflections for a task. You can bound the number injected via ReflectionConfig:
Task analysis = Task.builder() .description("analyze customer sentiment from support tickets") .reflect(true) .reflectionConfig(ReflectionConfig.builder() .maxReflections(5) .build()) .chatModel(model) .build();With maxReflections(5), only the 5 most recent reflections are injected. Older reflections remain in the store but are not included in the prompt. This prevents prompt bloat as the number of runs grows.
Custom Reflection Strategy
Section titled “Custom Reflection Strategy”The default strategy uses an LLM call to analyze the task output and generate an improvement. You can substitute a custom ReflectionStrategy:
public class DomainReflectionStrategy implements ReflectionStrategy {
@Override public Optional<String> reflect(ReflectionInput input) { String output = input.taskOutput(); // custom analysis: check for required sections, format, length if (!output.contains("## Summary")) { return Optional.of("Always include a ## Summary section as the first heading"); } // no improvement identified this time return Optional.empty(); }}
Task.builder() .description("write a technical design document") .reflect(true) .reflectionConfig(ReflectionConfig.builder() .strategy(new DomainReflectionStrategy()) .build()) .chatModel(model) .build();A custom strategy can use deterministic rules, call a different model, or apply domain-specific analysis. Returning Optional.empty() skips storage for that run.
The Reflection Store
Section titled “The Reflection Store”ReflectionStore is an interface with two methods:
public interface ReflectionStore { List<TaskReflection> load(TaskIdentity identity); void store(TaskIdentity identity, TaskReflection reflection);}InMemoryReflectionStore is included for development and testing. It holds reflections in a ConcurrentHashMap and loses state when the process stops.
For production, implement ReflectionStore against whatever persistence layer makes sense for your system — a relational database, a document store, or a key-value store:
public class JdbcReflectionStore implements ReflectionStore {
private final DataSource dataSource;
@Override public List<TaskReflection> load(TaskIdentity identity) { // SELECT content FROM task_reflections WHERE task_id = ? // ORDER BY created_at DESC LIMIT maxReflections }
@Override public void store(TaskIdentity identity, TaskReflection reflection) { // INSERT INTO task_reflections (task_id, content, created_at) VALUES (?, ?, ?) }}The TaskReflection record holds the improvement text and a timestamp. TaskIdentity includes the task description hash used for keying.
Disabling Reflection Selectively
Section titled “Disabling Reflection Selectively”Reflection is opt-in per task. Tasks without .reflect(true) are unaffected even if a ReflectionStore is configured on the ensemble. You can enable reflection for high-value tasks and leave it off for tasks where the instructions are stable or where the cost of an extra LLM call isn’t justified.
Ensemble.builder() .tasks(List.of( stableDataFetchTask, // no reflection evolvingAnalysisTask, // .reflect(true) stableFormattingTask // no reflection )) .reflectionStore(store) .chatModel(model) .build() .run();The store is queried and written only for tasks with reflection enabled.
Tradeoffs
Section titled “Tradeoffs”Reflection adds an LLM call per reflective task per run. For tasks that run thousands of times, this adds up. The cost is bounded if reflections converge — if the task’s instructions become stable after a few runs, reflections may produce no new improvements and Optional.empty() returns more often.
Reflections can drift. If a task’s purpose changes — the description is updated, the downstream context changes, the data it processes shifts — earlier reflections may no longer apply. maxReflections helps here by aging out old improvements. For significant task changes, clearing the stored reflections for that task is reasonable.
Reflection is not a substitute for good initial instructions. A task with fundamentally unclear instructions will accumulate reflections that patch around the ambiguity. The better use is to start with reasonable instructions and use reflection to sharpen them in response to real outputs over time.
The original task definition is never modified. All improvements live in the store. This is a deliberate choice: the source of truth for what a task does remains in code, not in a mutable prompt that silently drifts over time.
Guide: Task Reflection | Design: Task Reflection | GitHub
AgentEnsemble is open-source under the MIT license.