Evaluate LLM Output Quality
Score AI-generated answers on helpfulness, relevance, faithfulness, clarity, and safety. Detect hallucinations, flag policy risks, and recommend human review when needed.
LLM Output Evaluator & Hallucination Checker
This workflow rigorously evaluates the quality and reliability of AI-generated answers. It checks faithfulness, helpfulness, clarity, relevance, and safety while spotting hallucinated or unsupported claims and assessing policy risks.
It does five things:
- Takes an original user prompt, optional context documents, and an AI-generated answer as inputs.
- Identifies factual claims within the answer and verifies them against context documents or common knowledge.
- Scores the answer on helpfulness, relevance, grounding (faithfulness), clarity, and confidence.
- Flags suspected hallucinations, compiles unsupported claims, and assesses policy risk.
- Determines whether the output needs human review based on strict safety and accuracy criteria.
What You Need
- A Needle platform account with access to the AI model connector.
- Optional context documents or a retrieval system to provide reference information.
- Ability to enable web search and browsing tools if fact-check mode is desired for verifying claims online.
- (Optional) A Google Sheets connector if you want to log evaluation results.
- (Optional) A Slack connector if you want alerts on FAIL or REVIEW verdicts.
Toggleable Features
All features below are controlled via workflow variables and default to OFF for token efficiency:
| Variable | Purpose |
|---|---|
| ENABLE_BATCH_MODE | Process arrays of evaluations in parallel |
| ENABLE_FACT_CHECK | Use web search to verify suspicious factual claims |
| ENABLE_MULTI_MODEL | Run dual-model consensus evaluation for higher confidence |
| ENABLE_SHEETS_LOG | Log evaluation results to Google Sheets |
| ENABLE_SLACK_ALERT | Send Slack alerts on FAIL or REVIEW verdicts |
How the Flow Works
- Input Processing — Parses the incoming data (prompt, context, and AI-generated answer) and reads all feature flags. Normalizes input to an array when batch mode is enabled and passes flags downstream with each item.
- Primary AI Evaluator — Acts as a strict output evaluator (not a chatbot). It identifies every factual claim, verifies claims against context or the web (when fact-check is enabled), scores the answer on five dimensions, and flags hallucinations and policy risks. Returns structured evaluation data.
- Secondary AI Evaluator (Optional) — When multi-model mode is enabled, a second AI model independently evaluates the same answer. When disabled, this step is skipped to save tokens.
- Score Aggregation & Finalization — Merges scores from one or both evaluators using weighted averages. Computes a composite score, determines a PASS/REVIEW/FAIL verdict, unions hallucinated claims, and calculates model agreement deltas when multi-model is active.
- Google Sheets Logging (Optional) — When enabled, flattens the evaluation into a spreadsheet row and upserts it to Google Sheets.
- Slack Alerting (Optional) — When enabled and the verdict is FAIL or REVIEW, sends a formatted alert message to your configured Slack channel.
Scoring Dimensions
| Dimension | What It Measures | Scale |
|---|---|---|
| Helpfulness | Is the answer useful, complete, and actionable? | 1–5 |
| Relevance | Does the answer directly address what was asked? | 1–5 |
| Grounding | How faithful is the answer to the provided context? | 1–5 |
| Clarity | Is the answer well-structured and unambiguous? | 1–5 |
| Confidence | How confident is the evaluator in its own judgment? | 1–5 |
Scores are combined using weighted averages (grounding weighted highest at 35%) to produce a composite score.
Verdict Thresholds
- PASS — Composite score ≥ 4.0 with no hallucinations or policy risks.
- REVIEW — Hallucinations detected, any score ≤ 2, or composite below 4.0.
- FAIL — Severe policy risk, or hallucinations combined with grounding score ≤ 2.
Output
At the end, you get a detailed structured evaluation including:
- Numerical scores (1–5) for helpfulness, relevance, grounding, clarity, and confidence.
- A weighted composite score.
- A verdict of PASS, REVIEW, or FAIL.
- A boolean flag indicating if hallucination is suspected.
- An array listing each hallucinated or unsupported claim.
- Policy risk level (none, mild, or severe).
- A flag indicating if human review is recommended.
- A concise overall reasoning paragraph explaining the evaluation.
- Metadata about the evaluator model(s) used and scoring weights.
- Model agreement deltas (when multi-model mode is enabled).
Notes
- When fact-check mode is enabled, the evaluator uses web search and browsing tools to verify suspicious factual claims beyond the provided context.
- Hallucination detection is strict: even one unsupported claim triggers the hallucination flag.
- Low scores or potential risks automatically suggest human review to avoid automation pitfalls.
- The workflow supports batch processing and multi-model evaluation for robust, scalable quality control.
- Customize the evaluation rubric and thresholds by modifying the evaluator prompt or score aggregation code.
Want to showcase your own workflows?
Become a Needle workflow partner and turn your expertise into recurring revenue.
