Human-in-the-Loop Agent Systems in Java
Fully autonomous agents make great demos. In production, someone on your team will eventually ask: “Can a human check this before it goes out?”
The answer should be yes, and it shouldn’t require bolting on a custom approval system. Human-in-the-loop isn’t a limitation of agent systems — it’s a feature. The best agent architectures make it easy to insert human judgment at exactly the right points, without breaking the execution flow.
This post covers how AgentEnsemble handles human review: the review handler API, review policies, pre-flight validation, and the patterns that make this work in practice.
Why Human-in-the-Loop?
Section titled “Why Human-in-the-Loop?”Three reasons show up repeatedly in production agent deployments:
-
Quality assurance. LLMs produce plausible-sounding output that’s sometimes wrong. A human reviewer catches factual errors, hallucinations, and tone problems that automated checks miss.
-
Compliance. Regulated industries (finance, healthcare, legal) often require human approval before AI-generated content is used in customer-facing contexts. It’s not optional.
-
Calibration. When you deploy a new agent workflow, you want to review the first few outputs to verify the agents are behaving as expected before letting them run autonomously.
The Review Handler
Section titled “The Review Handler”The core abstraction is reviewHandler() — a function that receives a task’s output and returns a Review decision:
Ensemble.builder() .agents(researcher, writer) .tasks(researchTask, writeTask) .chatLanguageModel(model) .reviewHandler(taskOutput -> { System.out.println("=== REVIEW REQUIRED ==="); System.out.println("Task: " + taskOutput.getTaskDescription()); System.out.println("Agent: " + taskOutput.getAgentRole()); System.out.println("Output:\n" + taskOutput.getRaw()); System.out.println("========================"); System.out.print("Decision (approve/reject/edit): ");
String decision = scanner.nextLine().trim().toLowerCase();
return switch (decision) { case "approve" -> Review.approve(); case "reject" -> { System.out.print("Reason: "); yield Review.reject(scanner.nextLine()); } case "edit" -> { System.out.print("Enter corrected output: "); yield Review.edit(scanner.nextLine()); } default -> Review.approve(); }; }) .build() .run();Three possible outcomes:
Review.approve()— accept the output as-is. The ensemble continues with the next task.Review.reject(reason)— reject the output. The task is re-executed with the rejection reason fed back to the agent as additional context.Review.edit(correctedOutput)— replace the output with a human-provided correction. The ensemble continues with the edited version.
The review handler is a plain Java function. It can be a console prompt (as above), a REST call to an approval service, a Slack message that blocks until someone responds, or a database write that pauses the ensemble until a flag is flipped.
Review Policies
Section titled “Review Policies”Not every task needs human review. Review policies control which tasks trigger the handler:
Ensemble.builder() .agents(researcher, writer, editor) .tasks(researchTask, writeTask, editTask) .chatLanguageModel(model) .reviewHandler(this::handleReview) .reviewPolicy(ReviewPolicy.REVIEW_ALL) .build() .run();Available policies:
| Policy | Behavior |
|---|---|
REVIEW_ALL | Every task output goes through review. Use for high-stakes workflows. |
REVIEW_FAILED | Only tasks that failed and were retried, or hit max iterations. |
FIRST_TASK_ONLY | Review the first task output to calibrate. If approved, the rest run without review. |
FIRST_TASK_ONLY is particularly useful during the deployment phase. You review the first output to verify the agents are producing what you expect, then let the pipeline run autonomously for the remaining tasks.
Pre-Flight Validation
Section titled “Pre-Flight Validation”Sometimes you want an automated quality check before a human sees the output. The beforeReview() hook runs first:
Ensemble.builder() .agents(researcher, writer) .tasks(researchTask, writeTask) .chatLanguageModel(model) .beforeReview(taskOutput -> { String raw = taskOutput.getRaw();
// Automated checks if (raw == null || raw.isBlank()) { return Review.reject("Empty output"); } if (raw.length() < 200) { return Review.reject("Output too short -- minimum 200 characters"); } if (raw.contains("I don't know") || raw.contains("I cannot")) { return Review.reject("Agent declined the task"); }
// Passed automated checks -- proceed to human review return Review.skip(); }) .reviewHandler(this::humanReview) .reviewPolicy(ReviewPolicy.REVIEW_ALL) .build() .run();The flow is:
- Task completes.
beforeReview()runs automated checks.- If it returns
Review.reject(), the task re-executes. No human is bothered. - If it returns
Review.skip(), the output passes to the humanreviewHandler(). - If it returns
Review.approve(), the output is accepted without human review.
- If it returns
reviewHandler()presents the output to a human (ifbeforeReviewdidn’t already decide).
This pattern keeps humans focused on judgment calls, not on catching obvious failures that a simple check could handle.
Rejection and Re-Execution
Section titled “Rejection and Re-Execution”When a review is rejected — whether by beforeReview or the human reviewer — the framework re-executes the task with the rejection reason injected as additional context. The agent sees:
Previous attempt was rejected. Reason: “Output too short — minimum 200 characters”
This gives the agent a chance to correct its approach. It’s not just a retry — the agent has feedback on what went wrong.
You can limit the number of review cycles to prevent infinite loops:
Task criticalTask = Task.builder() .description("Write the executive summary") .expectedOutput("A concise, accurate summary") .agent(writer) .maxOutputRetries(3) // max 3 re-executions after rejection .build();If the output is still rejected after 3 attempts, the task fails with a clear error.
Patterns for Production Review Workflows
Section titled “Patterns for Production Review Workflows”Pattern 1: Console Review (Development)
Section titled “Pattern 1: Console Review (Development)”Good for testing and debugging:
.reviewHandler(taskOutput -> { System.out.println(taskOutput.getRaw()); System.out.print("Approve? (y/n): "); return scanner.nextLine().equals("y") ? Review.approve() : Review.reject("Rejected by developer");})Pattern 2: REST API Review (Production)
Section titled “Pattern 2: REST API Review (Production)”Block the ensemble until an external approval system responds:
.reviewHandler(taskOutput -> { // Submit for review String reviewId = approvalService.submitForReview( taskOutput.getTaskDescription(), taskOutput.getRaw() );
// Poll until decision is made ReviewDecision decision = approvalService.awaitDecision(reviewId);
return switch (decision.status()) { case APPROVED -> Review.approve(); case REJECTED -> Review.reject(decision.reason()); case EDITED -> Review.edit(decision.correctedOutput()); };})Pattern 3: Slack/Teams Notification
Section titled “Pattern 3: Slack/Teams Notification”Send a message and wait for a reaction:
.reviewHandler(taskOutput -> { String messageId = slack.postMessage( "#agent-reviews", formatForSlack(taskOutput) );
// Block until thumbs-up or thumbs-down reaction SlackReaction reaction = slack.awaitReaction(messageId, Duration.ofMinutes(30));
return reaction.isPositive() ? Review.approve() : Review.reject("Rejected via Slack");})Pattern 4: Automated-Only Review
Section titled “Pattern 4: Automated-Only Review”Skip the human entirely — use beforeReview for automated quality gates:
.beforeReview(taskOutput -> { QualityScore score = qualityChecker.evaluate(taskOutput.getRaw());
if (score.overall() >= 0.8) { return Review.approve(); // good enough, no human needed } else if (score.overall() >= 0.5) { return Review.skip(); // borderline, send to human } else { return Review.reject("Quality score too low: " + score.overall()); }}).reviewHandler(this::humanReviewForBorderlineCases)Pattern 5: Tiered Review by Task
Section titled “Pattern 5: Tiered Review by Task”Use task-level configuration to vary review intensity:
// Critical task -- always reviewedTask customerEmail = Task.builder() .description("Draft a response to the customer complaint") .expectedOutput("Professional, empathetic email response") .agent(writer) .build();
// Internal task -- skip reviewTask internalSummary = Task.builder() .description("Summarize the complaint for internal tracking") .expectedOutput("Brief internal summary") .agent(writer) .build();
Ensemble.builder() .agents(writer) .tasks(customerEmail, internalSummary) .chatLanguageModel(model) .reviewHandler(taskOutput -> { // Only review customer-facing tasks if (taskOutput.getTaskDescription().contains("customer")) { return humanReview(taskOutput); } return Review.approve(); // skip internal tasks }) .reviewPolicy(ReviewPolicy.REVIEW_ALL) .build() .run();Combining Review with Other Production Features
Section titled “Combining Review with Other Production Features”Review gates compose naturally with other AgentEnsemble features:
Review + Guardrails
Section titled “Review + Guardrails”Guardrails catch invalid content at the agent level. Review catches quality issues at the workflow level.
Agent writer = Agent.builder() .role("Content Writer") .goal("Write marketing copy") .outputGuardrail(output -> { if (containsPII(output)) { return GuardrailResult.reject("Output contains PII"); } return GuardrailResult.accept(); }) .build();
Ensemble.builder() .agents(writer) .tasks(writeTask) .chatLanguageModel(model) .beforeReview(this::automatedQualityCheck) .reviewHandler(this::humanReview) .build() .run();The execution flow is: Agent runs -> Guardrail validates -> Pre-flight check -> Human review. Each layer catches different classes of problems.
Review + Callbacks
Section titled “Review + Callbacks”Track review decisions alongside other execution events:
Ensemble.builder() .agents(writer) .tasks(writeTask) .chatLanguageModel(model) .reviewHandler(taskOutput -> { Review decision = humanReview(taskOutput); auditLog.record(taskOutput.getTaskDescription(), decision.getType(), decision.getReason()); return decision; }) .listener(event -> { if (event instanceof TaskCompleteEvent e) { metrics.recordTaskCompletion(e); } }) .build() .run();Review + Structured Output
Section titled “Review + Structured Output”Review typed output, not raw strings:
record ProposalDraft( String title, String executiveSummary, List<String> keyPoints, double estimatedBudget) {}
Task proposalTask = Task.builder() .description("Draft a project proposal") .expectedOutput("Structured proposal") .agent(writer) .outputType(ProposalDraft.class) .build();
Ensemble.builder() .agents(writer) .tasks(proposalTask) .chatLanguageModel(model) .reviewHandler(taskOutput -> { ProposalDraft draft = taskOutput .getStructuredOutput(ProposalDraft.class);
// Review specific fields if (draft.estimatedBudget() > 100_000) { return Review.reject("Budget exceeds approval threshold"); } if (draft.keyPoints().size() < 3) { return Review.reject("Need at least 3 key points"); } return Review.approve(); }) .build() .run();The Design Philosophy
Section titled “The Design Philosophy”Human-in-the-loop isn’t an escape hatch for when agents fail. It’s a first-class architectural decision. The best agent systems are designed with human review points from the start, not retrofitted when something goes wrong in production.
AgentEnsemble makes this easy by treating review as a builder method, not a separate system. Same API, same execution flow, same observability. A human reviewer is just another step in the pipeline.
Get started:
- Documentation — guides, examples, and API reference
- Review Guide — full API reference for human-in-the-loop
- Getting Started — up and running in 5 minutes
- GitHub — source, issues, and contributions
AgentEnsemble is MIT-licensed and available on GitHub.