Evaluate LLM Output Quality

OpenAIClaudeGoogle SheetsSlack

Score AI-generated answers on helpfulness, relevance, faithfulness, clarity, and safety. Detect hallucinations, flag policy risks, and recommend human review when needed.

Hallucination DetectionAI GovernanceQuality AssuranceContent Evaluation

LLM Output Evaluator & Hallucination Checker

This workflow rigorously evaluates the quality and reliability of AI-generated answers. It checks faithfulness, helpfulness, clarity, relevance, and safety while spotting hallucinated or unsupported claims and assessing policy risks.

It does five things:

  1. Takes an original user prompt, optional context documents, and an AI-generated answer as inputs.
  2. Identifies factual claims within the answer and verifies them against context documents or common knowledge.
  3. Scores the answer on helpfulness, relevance, grounding (faithfulness), clarity, and confidence.
  4. Flags suspected hallucinations, compiles unsupported claims, and assesses policy risk.
  5. Determines whether the output needs human review based on strict safety and accuracy criteria.

What You Need

  • A Needle platform account with access to the AI model connector.
  • Optional context documents or a retrieval system to provide reference information.
  • Ability to enable web search and browsing tools if fact-check mode is desired for verifying claims online.
  • (Optional) A Google Sheets connector if you want to log evaluation results.
  • (Optional) A Slack connector if you want alerts on FAIL or REVIEW verdicts.

Toggleable Features

All features below are controlled via workflow variables and default to OFF for token efficiency:

VariablePurpose
ENABLE_BATCH_MODEProcess arrays of evaluations in parallel
ENABLE_FACT_CHECKUse web search to verify suspicious factual claims
ENABLE_MULTI_MODELRun dual-model consensus evaluation for higher confidence
ENABLE_SHEETS_LOGLog evaluation results to Google Sheets
ENABLE_SLACK_ALERTSend Slack alerts on FAIL or REVIEW verdicts

How the Flow Works

  1. Input Processing — Parses the incoming data (prompt, context, and AI-generated answer) and reads all feature flags. Normalizes input to an array when batch mode is enabled and passes flags downstream with each item.
  2. Primary AI Evaluator — Acts as a strict output evaluator (not a chatbot). It identifies every factual claim, verifies claims against context or the web (when fact-check is enabled), scores the answer on five dimensions, and flags hallucinations and policy risks. Returns structured evaluation data.
  3. Secondary AI Evaluator (Optional) — When multi-model mode is enabled, a second AI model independently evaluates the same answer. When disabled, this step is skipped to save tokens.
  4. Score Aggregation & Finalization — Merges scores from one or both evaluators using weighted averages. Computes a composite score, determines a PASS/REVIEW/FAIL verdict, unions hallucinated claims, and calculates model agreement deltas when multi-model is active.
  5. Google Sheets Logging (Optional) — When enabled, flattens the evaluation into a spreadsheet row and upserts it to Google Sheets.
  6. Slack Alerting (Optional) — When enabled and the verdict is FAIL or REVIEW, sends a formatted alert message to your configured Slack channel.

Scoring Dimensions

DimensionWhat It MeasuresScale
HelpfulnessIs the answer useful, complete, and actionable?1–5
RelevanceDoes the answer directly address what was asked?1–5
GroundingHow faithful is the answer to the provided context?1–5
ClarityIs the answer well-structured and unambiguous?1–5
ConfidenceHow confident is the evaluator in its own judgment?1–5

Scores are combined using weighted averages (grounding weighted highest at 35%) to produce a composite score.

Verdict Thresholds

  • PASS — Composite score ≥ 4.0 with no hallucinations or policy risks.
  • REVIEW — Hallucinations detected, any score ≤ 2, or composite below 4.0.
  • FAIL — Severe policy risk, or hallucinations combined with grounding score ≤ 2.

Output

At the end, you get a detailed structured evaluation including:

  • Numerical scores (1–5) for helpfulness, relevance, grounding, clarity, and confidence.
  • A weighted composite score.
  • A verdict of PASS, REVIEW, or FAIL.
  • A boolean flag indicating if hallucination is suspected.
  • An array listing each hallucinated or unsupported claim.
  • Policy risk level (none, mild, or severe).
  • A flag indicating if human review is recommended.
  • A concise overall reasoning paragraph explaining the evaluation.
  • Metadata about the evaluator model(s) used and scoring weights.
  • Model agreement deltas (when multi-model mode is enabled).

Notes

  • When fact-check mode is enabled, the evaluator uses web search and browsing tools to verify suspicious factual claims beyond the provided context.
  • Hallucination detection is strict: even one unsupported claim triggers the hallucination flag.
  • Low scores or potential risks automatically suggest human review to avoid automation pitfalls.
  • The workflow supports batch processing and multi-model evaluation for robust, scalable quality control.
  • Customize the evaluation rubric and thresholds by modifying the evaluator prompt or score aggregation code.

Want to showcase your own workflows?

Become a Needle workflow partner and turn your expertise into recurring revenue.

Ready to vibe automate?

Join thousands of people who have transformed their workflows.

Workflows

Automations with AI agents

Collections

All your data, searchable

Chat Widget

Drop-in widget for your website

Developer API

Build AI-powered apps with ease

    We use cookies to enhance your experience on Needle and keep your data secure. Privacy Policy