Evaluate LLM Output Quality

Score AI-generated answers on helpfulness, relevance, faithfulness, clarity, and safety. Detect hallucinations, flag policy risks, and recommend human review when needed.

Hallucination DetectionAI GovernanceQuality AssuranceContent Evaluation

Andrew HenselUnited States

React Flow

LLM Output Evaluator & Hallucination Checker

This workflow rigorously evaluates the quality and reliability of AI-generated answers. It checks faithfulness, helpfulness, clarity, relevance, and safety while spotting hallucinated or unsupported claims and assessing policy risks.

It does five things:

Takes an original user prompt, optional context documents, and an AI-generated answer as inputs.
Identifies factual claims within the answer and verifies them against context documents or common knowledge.
Scores the answer on helpfulness, relevance, grounding (faithfulness), clarity, and confidence.
Flags suspected hallucinations, compiles unsupported claims, and assesses policy risk.
Determines whether the output needs human review based on strict safety and accuracy criteria.

What You Need

A Needle platform account with access to the AI model connector.
Optional context documents or a retrieval system to provide reference information.
Ability to enable web search and browsing tools if fact-check mode is desired for verifying claims online.
(Optional) A Google Sheets connector if you want to log evaluation results.
(Optional) A Slack connector if you want alerts on FAIL or REVIEW verdicts.

Toggleable Features

All features below are controlled via workflow variables and default to OFF for token efficiency:

Variable	Purpose
ENABLE_BATCH_MODE	Process arrays of evaluations in parallel
ENABLE_FACT_CHECK	Use web search to verify suspicious factual claims
ENABLE_MULTI_MODEL	Run dual-model consensus evaluation for higher confidence
ENABLE_SHEETS_LOG	Log evaluation results to Google Sheets
ENABLE_SLACK_ALERT	Send Slack alerts on FAIL or REVIEW verdicts

How the Flow Works

Input Processing — Parses the incoming data (prompt, context, and AI-generated answer) and reads all feature flags. Normalizes input to an array when batch mode is enabled and passes flags downstream with each item.
Primary AI Evaluator — Acts as a strict output evaluator (not a chatbot). It identifies every factual claim, verifies claims against context or the web (when fact-check is enabled), scores the answer on five dimensions, and flags hallucinations and policy risks. Returns structured evaluation data.
Secondary AI Evaluator (Optional) — When multi-model mode is enabled, a second AI model independently evaluates the same answer. When disabled, this step is skipped to save tokens.
Score Aggregation & Finalization — Merges scores from one or both evaluators using weighted averages. Computes a composite score, determines a PASS/REVIEW/FAIL verdict, unions hallucinated claims, and calculates model agreement deltas when multi-model is active.
Google Sheets Logging (Optional) — When enabled, flattens the evaluation into a spreadsheet row and upserts it to Google Sheets.
Slack Alerting (Optional) — When enabled and the verdict is FAIL or REVIEW, sends a formatted alert message to your configured Slack channel.

Scoring Dimensions

Dimension	What It Measures	Scale
Helpfulness	Is the answer useful, complete, and actionable?	1–5
Relevance	Does the answer directly address what was asked?	1–5
Grounding	How faithful is the answer to the provided context?	1–5
Clarity	Is the answer well-structured and unambiguous?	1–5
Confidence	How confident is the evaluator in its own judgment?	1–5

Scores are combined using weighted averages (grounding weighted highest at 35%) to produce a composite score.

Verdict Thresholds

PASS — Composite score ≥ 4.0 with no hallucinations or policy risks.
REVIEW — Hallucinations detected, any score ≤ 2, or composite below 4.0.
FAIL — Severe policy risk, or hallucinations combined with grounding score ≤ 2.

Output

At the end, you get a detailed structured evaluation including:

Numerical scores (1–5) for helpfulness, relevance, grounding, clarity, and confidence.
A weighted composite score.
A verdict of PASS, REVIEW, or FAIL.
A boolean flag indicating if hallucination is suspected.
An array listing each hallucinated or unsupported claim.
Policy risk level (none, mild, or severe).
A flag indicating if human review is recommended.
A concise overall reasoning paragraph explaining the evaluation.
Metadata about the evaluator model(s) used and scoring weights.
Model agreement deltas (when multi-model mode is enabled).

Notes

When fact-check mode is enabled, the evaluator uses web search and browsing tools to verify suspicious factual claims beyond the provided context.
Hallucination detection is strict: even one unsupported claim triggers the hallucination flag.
Low scores or potential risks automatically suggest human review to avoid automation pitfalls.
The workflow supports batch processing and multi-model evaluation for robust, scalable quality control.
Customize the evaluation rubric and thresholds by modifying the evaluator prompt or score aggregation code.

Want to showcase your own workflows?

Become a Needle workflow partner and turn your expertise into recurring revenue.

Learn how

Evaluate LLM Output Quality

LLM Output Evaluator & Hallucination Checker

What You Need

Toggleable Features

How the Flow Works

Scoring Dimensions

Verdict Thresholds

Output

Notes

Want to showcase your own workflows?

Ready to vibe automate?

Workflows

Collections

Chat Widget

Developer API