Skip to content

Human-in-the-Loop Agent Systems in Java

Fully autonomous agents make great demos. In production, someone on your team will eventually ask: “Can a human check this before it goes out?”

The answer should be yes, and it shouldn’t require bolting on a custom approval system. Human-in-the-loop isn’t a limitation of agent systems — it’s a feature. The best agent architectures make it easy to insert human judgment at exactly the right points, without breaking the execution flow.

This post covers how AgentEnsemble handles human review: the review handler API, review policies, pre-flight validation, and the patterns that make this work in practice.

Three reasons show up repeatedly in production agent deployments:

  1. Quality assurance. LLMs produce plausible-sounding output that’s sometimes wrong. A human reviewer catches factual errors, hallucinations, and tone problems that automated checks miss.

  2. Compliance. Regulated industries (finance, healthcare, legal) often require human approval before AI-generated content is used in customer-facing contexts. It’s not optional.

  3. Calibration. When you deploy a new agent workflow, you want to review the first few outputs to verify the agents are behaving as expected before letting them run autonomously.

The core abstraction is reviewHandler() — a function that receives a task’s output and returns a Review decision:

Ensemble.builder()
.agents(researcher, writer)
.tasks(researchTask, writeTask)
.chatLanguageModel(model)
.reviewHandler(taskOutput -> {
System.out.println("=== REVIEW REQUIRED ===");
System.out.println("Task: " + taskOutput.getTaskDescription());
System.out.println("Agent: " + taskOutput.getAgentRole());
System.out.println("Output:\n" + taskOutput.getRaw());
System.out.println("========================");
System.out.print("Decision (approve/reject/edit): ");
String decision = scanner.nextLine().trim().toLowerCase();
return switch (decision) {
case "approve" -> Review.approve();
case "reject" -> {
System.out.print("Reason: ");
yield Review.reject(scanner.nextLine());
}
case "edit" -> {
System.out.print("Enter corrected output: ");
yield Review.edit(scanner.nextLine());
}
default -> Review.approve();
};
})
.build()
.run();

Three possible outcomes:

  • Review.approve() — accept the output as-is. The ensemble continues with the next task.
  • Review.reject(reason) — reject the output. The task is re-executed with the rejection reason fed back to the agent as additional context.
  • Review.edit(correctedOutput) — replace the output with a human-provided correction. The ensemble continues with the edited version.

The review handler is a plain Java function. It can be a console prompt (as above), a REST call to an approval service, a Slack message that blocks until someone responds, or a database write that pauses the ensemble until a flag is flipped.

Not every task needs human review. Review policies control which tasks trigger the handler:

Ensemble.builder()
.agents(researcher, writer, editor)
.tasks(researchTask, writeTask, editTask)
.chatLanguageModel(model)
.reviewHandler(this::handleReview)
.reviewPolicy(ReviewPolicy.REVIEW_ALL)
.build()
.run();

Available policies:

PolicyBehavior
REVIEW_ALLEvery task output goes through review. Use for high-stakes workflows.
REVIEW_FAILEDOnly tasks that failed and were retried, or hit max iterations.
FIRST_TASK_ONLYReview the first task output to calibrate. If approved, the rest run without review.

FIRST_TASK_ONLY is particularly useful during the deployment phase. You review the first output to verify the agents are producing what you expect, then let the pipeline run autonomously for the remaining tasks.

Sometimes you want an automated quality check before a human sees the output. The beforeReview() hook runs first:

Ensemble.builder()
.agents(researcher, writer)
.tasks(researchTask, writeTask)
.chatLanguageModel(model)
.beforeReview(taskOutput -> {
String raw = taskOutput.getRaw();
// Automated checks
if (raw == null || raw.isBlank()) {
return Review.reject("Empty output");
}
if (raw.length() < 200) {
return Review.reject("Output too short -- minimum 200 characters");
}
if (raw.contains("I don't know") || raw.contains("I cannot")) {
return Review.reject("Agent declined the task");
}
// Passed automated checks -- proceed to human review
return Review.skip();
})
.reviewHandler(this::humanReview)
.reviewPolicy(ReviewPolicy.REVIEW_ALL)
.build()
.run();

The flow is:

  1. Task completes.
  2. beforeReview() runs automated checks.
    • If it returns Review.reject(), the task re-executes. No human is bothered.
    • If it returns Review.skip(), the output passes to the human reviewHandler().
    • If it returns Review.approve(), the output is accepted without human review.
  3. reviewHandler() presents the output to a human (if beforeReview didn’t already decide).

This pattern keeps humans focused on judgment calls, not on catching obvious failures that a simple check could handle.

When a review is rejected — whether by beforeReview or the human reviewer — the framework re-executes the task with the rejection reason injected as additional context. The agent sees:

Previous attempt was rejected. Reason: “Output too short — minimum 200 characters”

This gives the agent a chance to correct its approach. It’s not just a retry — the agent has feedback on what went wrong.

You can limit the number of review cycles to prevent infinite loops:

Task criticalTask = Task.builder()
.description("Write the executive summary")
.expectedOutput("A concise, accurate summary")
.agent(writer)
.maxOutputRetries(3) // max 3 re-executions after rejection
.build();

If the output is still rejected after 3 attempts, the task fails with a clear error.

Good for testing and debugging:

.reviewHandler(taskOutput -> {
System.out.println(taskOutput.getRaw());
System.out.print("Approve? (y/n): ");
return scanner.nextLine().equals("y")
? Review.approve()
: Review.reject("Rejected by developer");
})

Block the ensemble until an external approval system responds:

.reviewHandler(taskOutput -> {
// Submit for review
String reviewId = approvalService.submitForReview(
taskOutput.getTaskDescription(),
taskOutput.getRaw()
);
// Poll until decision is made
ReviewDecision decision = approvalService.awaitDecision(reviewId);
return switch (decision.status()) {
case APPROVED -> Review.approve();
case REJECTED -> Review.reject(decision.reason());
case EDITED -> Review.edit(decision.correctedOutput());
};
})

Send a message and wait for a reaction:

.reviewHandler(taskOutput -> {
String messageId = slack.postMessage(
"#agent-reviews",
formatForSlack(taskOutput)
);
// Block until thumbs-up or thumbs-down reaction
SlackReaction reaction = slack.awaitReaction(messageId,
Duration.ofMinutes(30));
return reaction.isPositive()
? Review.approve()
: Review.reject("Rejected via Slack");
})

Skip the human entirely — use beforeReview for automated quality gates:

.beforeReview(taskOutput -> {
QualityScore score = qualityChecker.evaluate(taskOutput.getRaw());
if (score.overall() >= 0.8) {
return Review.approve(); // good enough, no human needed
} else if (score.overall() >= 0.5) {
return Review.skip(); // borderline, send to human
} else {
return Review.reject("Quality score too low: " + score.overall());
}
})
.reviewHandler(this::humanReviewForBorderlineCases)

Use task-level configuration to vary review intensity:

// Critical task -- always reviewed
Task customerEmail = Task.builder()
.description("Draft a response to the customer complaint")
.expectedOutput("Professional, empathetic email response")
.agent(writer)
.build();
// Internal task -- skip review
Task internalSummary = Task.builder()
.description("Summarize the complaint for internal tracking")
.expectedOutput("Brief internal summary")
.agent(writer)
.build();
Ensemble.builder()
.agents(writer)
.tasks(customerEmail, internalSummary)
.chatLanguageModel(model)
.reviewHandler(taskOutput -> {
// Only review customer-facing tasks
if (taskOutput.getTaskDescription().contains("customer")) {
return humanReview(taskOutput);
}
return Review.approve(); // skip internal tasks
})
.reviewPolicy(ReviewPolicy.REVIEW_ALL)
.build()
.run();

Combining Review with Other Production Features

Section titled “Combining Review with Other Production Features”

Review gates compose naturally with other AgentEnsemble features:

Guardrails catch invalid content at the agent level. Review catches quality issues at the workflow level.

Agent writer = Agent.builder()
.role("Content Writer")
.goal("Write marketing copy")
.outputGuardrail(output -> {
if (containsPII(output)) {
return GuardrailResult.reject("Output contains PII");
}
return GuardrailResult.accept();
})
.build();
Ensemble.builder()
.agents(writer)
.tasks(writeTask)
.chatLanguageModel(model)
.beforeReview(this::automatedQualityCheck)
.reviewHandler(this::humanReview)
.build()
.run();

The execution flow is: Agent runs -> Guardrail validates -> Pre-flight check -> Human review. Each layer catches different classes of problems.

Track review decisions alongside other execution events:

Ensemble.builder()
.agents(writer)
.tasks(writeTask)
.chatLanguageModel(model)
.reviewHandler(taskOutput -> {
Review decision = humanReview(taskOutput);
auditLog.record(taskOutput.getTaskDescription(),
decision.getType(), decision.getReason());
return decision;
})
.listener(event -> {
if (event instanceof TaskCompleteEvent e) {
metrics.recordTaskCompletion(e);
}
})
.build()
.run();

Review typed output, not raw strings:

record ProposalDraft(
String title,
String executiveSummary,
List<String> keyPoints,
double estimatedBudget
) {}
Task proposalTask = Task.builder()
.description("Draft a project proposal")
.expectedOutput("Structured proposal")
.agent(writer)
.outputType(ProposalDraft.class)
.build();
Ensemble.builder()
.agents(writer)
.tasks(proposalTask)
.chatLanguageModel(model)
.reviewHandler(taskOutput -> {
ProposalDraft draft = taskOutput
.getStructuredOutput(ProposalDraft.class);
// Review specific fields
if (draft.estimatedBudget() > 100_000) {
return Review.reject("Budget exceeds approval threshold");
}
if (draft.keyPoints().size() < 3) {
return Review.reject("Need at least 3 key points");
}
return Review.approve();
})
.build()
.run();

Human-in-the-loop isn’t an escape hatch for when agents fail. It’s a first-class architectural decision. The best agent systems are designed with human review points from the start, not retrofitted when something goes wrong in production.

AgentEnsemble makes this easy by treating review as a builder method, not a separate system. Same API, same execution flow, same observability. A human reviewer is just another step in the pipeline.


Get started:


AgentEnsemble is MIT-licensed and available on GitHub.