$ cat posts/2026-04-14.mdx

How We Built a Production Hallucination Detection Pipeline

April 14, 2026 · ImmovableTech Team

LLM
LangGraph
Hallucination Detection
Production AI

The Problem: Content at Scale Needs Fact-Checking

When an enterprise content platform processes tens of thousands of articles, social posts, and marketing assets every day, small inaccuracies compound fast. A single hallucinated statistic in a published piece erodes reader trust. Multiply that across 30,000+ pieces daily and you have a brand-risk problem that manual review cannot solve.

Our client needed an automated verification layer that could sit between content generation and publication, flag factual inconsistencies, and do it without slowing down the editorial pipeline.

Why We Chose a Multi-Agent Architecture

The obvious first approach — run each piece through a single LLM prompt and ask “is this factually correct?” — fails in practice for three reasons:

Monolithic prompts hallucinate about hallucinations. A single model evaluating its own outputs (or outputs from a similar model) tends to rubber-stamp plausible-sounding claims.
Different verification tasks need different strategies. Checking a statistical claim requires source retrieval. Checking logical consistency requires structural reasoning. Checking attribution requires entity resolution. One prompt cannot do all three well.
Latency budgets vary by content type. A 500-word social post needs sub-second verification. A 5,000-word report can tolerate 10 seconds. A monolithic pipeline cannot adapt.

We chose LangGraph because it gave us explicit control over the agent execution graph — which stages run in parallel, which are conditional, and where human review gets injected.

Where Agents Are in 2026

We built this pipeline before MCP (Model Context Protocol) became the universal standard for tool-calling agents — it now has over 97 million SDK downloads and sits under Linux Foundation governance. If we were rebuilding today, we’d structure the retrieval and scoring agents as MCP tool servers rather than tightly coupled LangGraph nodes. That would let any MCP-compatible client (Claude Desktop, a custom orchestrator, a third-party integration) invoke our verification tools without writing model-specific function-calling code. We’d also look seriously at Google’s A2A (Agent-to-Agent) protocol for the inter-agent communication layer — the claim extraction agent handing off to the retrieval agent is exactly the kind of structured agent-to-agent handoff that A2A was designed for. The core architecture — decomposed agents with explicit routing — holds up well. The wiring between them is what we’d modernise.

The Pipeline: Three Stages, Five Agents

Stage 1: Claim Extraction

The first agent parses incoming content into discrete, verifiable claims. A 2,000-word article might yield 15-40 claims ranging from “Company X reported $2.3B in Q3 revenue” to “This approach reduces latency by 40%.”

We used structured output parsing with a strict JSON schema so every downstream agent receives claims in a consistent format — claim text, claim type (statistical, causal, attributive), confidence that the claim is actually a factual assertion (vs. opinion), and the source sentence.

Stage 2: Source Retrieval and Cross-Reference

For each extracted claim, a retrieval agent searches a curated knowledge base and, where permitted, the open web. We built three retrieval strategies:

Vector similarity search against an internal knowledge base (Pinecone) for domain-specific facts
Structured database lookups for financial figures, dates, and named entities
Web search fallback for claims that reference recent events or external data

The retrieval agent returns ranked evidence passages with relevance scores. Claims with no retrievable evidence get flagged as “unverifiable” rather than “false” — an important distinction that reduced false-positive rates by 34%.

Stage 3: Factuality Scoring

The scoring agent takes each claim paired with its retrieved evidence and produces a factuality verdict: confirmed, contradicted, partially supported, or unverifiable. It also generates a human-readable explanation citing the specific evidence that supports or contradicts the claim.

We fine-tuned this stage on 8,000 manually labelled claim-evidence pairs from the client’s domain. The fine-tuned model outperformed zero-shot GPT-4 by 11 percentage points on our evaluation set.

Key Engineering Decisions

Batching over streaming. Content arrives in bursts. We batch claims into groups of 50 for retrieval and scoring, which reduced per-claim latency from 3.2 seconds to 0.8 seconds by amortising embedding and API call overhead.

Async pipeline with Redis queues. Each stage runs independently. If the retrieval service is slow, claim extraction continues filling the queue. This decoupling let us scale each stage independently and eliminated cascading timeouts.

Confidence thresholds, not binary gates. Rather than blocking all content below a threshold, we route to three paths: auto-publish (high confidence), editor review (medium confidence), and auto-hold (low confidence). This preserved editorial velocity — 78% of content passes through without human intervention.

Continuous calibration. Editor corrections feed back into the training set. Every month we re-evaluate the scoring model against the latest corrections and retrain if accuracy drops below 90%.

Under the Hood: LangGraph Agent Definition

Here’s a simplified version of how the verification graph is wired together:

from langgraph.graph import StateGraph, END
from typing import TypedDict, List

class VerificationState(TypedDict):
    content: str
    claims: List[dict]
    evidence: List[dict]
    verdicts: List[dict]

graph = StateGraph(VerificationState)
graph.add_node("extract_claims", extract_claims_agent)
graph.add_node("retrieve_evidence", retrieve_evidence_agent)
graph.add_node("score_factuality", score_factuality_agent)
graph.add_node("route_decision", route_decision_agent)

graph.set_entry_point("extract_claims")
graph.add_edge("extract_claims", "retrieve_evidence")
graph.add_edge("retrieve_evidence", "score_factuality")
graph.add_edge("score_factuality", "route_decision")
graph.add_conditional_edges("route_decision", decide_next, {
    "auto_publish": END,
    "editor_review": END,
    "auto_hold": END,
})

pipeline = graph.compile()

The key thing to notice: the graph is deterministic. We know exactly which node runs after which. Unlike prompt-chained agent systems, there’s no ambiguity about execution order — and when something fails, we know exactly where.

Results

After six weeks from kickoff to production:

30,000+ content pieces verified daily without editorial bottleneck
92% factuality accuracy on our held-out evaluation set
78% auto-publish rate — most content passes through without manual review
34% reduction in false positives compared to the single-prompt baseline we benchmarked against

Lessons Learned

Start with evaluation, not architecture. We spent the first week building the labelled evaluation dataset before writing a single line of pipeline code. Every architecture decision was tested against this dataset, which prevented us from shipping a system that “felt right” but scored poorly.

Multi-agent is not always better. We initially had seven agents. We merged two and eliminated one when analysis showed they added latency without improving accuracy. The final five-agent graph was the result of pruning, not additive design.

The “unverifiable” category is your best friend. Forcing binary true/false verdicts on every claim creates noise. Allowing “unverifiable” as a legitimate outcome made the system more trustworthy to editors, which drove adoption.

What We’d Do Differently in 2026

Use different model families for extraction vs. verification. We’d use GPT-4o for claim extraction and Claude 3.5 Sonnet as the cross-model verifier. Using different model families reduces the chance of correlated hallucinations — if both models hallucinate the same way on the same claim, your verification is worthless. Diverse model families give you genuine independent checks.
Add LangSmith traces from day one. We retrofitted observability in week 4 and immediately found the scoring agent was making 3x more API calls than necessary on long-form content — it was re-embedding evidence passages it had already seen. That’s the kind of waste you only catch with end-to-end tracing, and we lost three weeks of debugging visibility by not instrumenting earlier.
Expose the pipeline as an MCP server. Right now the verification pipeline is only accessible through our custom frontend and internal API. If we rebuilt it, we’d expose it as an MCP tool server so any MCP-compatible client could call it — editors using Claude Desktop, external content platforms, third-party CMS integrations. The pipeline is the product; the interface shouldn’t be the bottleneck.

This project is part of our AI & Machine Learning Engineering practice. Read the full case study or talk to us about a similar challenge.