Hallucination Detection & Content Verification

$ cd ~/projects/hallucination-detection agent.shipped · in production

LLM Safety
at Scale.

30K+ pieces of LLM-generated content a day,
each one checked by a frontier extractor and a
reasoning-model verifier before it reaches a reader.
92.4% factuality F1, sub-350 ms p95 per claim,
Ragas + Promptfoo regressions on every PR.

Hallucination detection & content verification

Industry: Enterprise AI
Timeline: 6 weeks
Key result: 92.4% factuality F1
Tech stack: LangGraph Platform, reasoning-model verifier, GraphRAG / HippoRAG, Cohere Embed v4 + Rerank 3.5, pgvectorscale, LiteLLM gateway, Langfuse, Ragas, Python, FastAPI, vLLM, Redis Streams

The enterprise content team was drowning in AI-drafted copy with no way to catch factual drift until a reader complained. We built a pipeline that does that work before the copy ever ships.

GPT-4o pulls claims out of each draft, Claude 3.5 Sonnet verifies them against retrieved sources, and a consensus scorer decides what’s safe to auto-publish and what needs a human. 30K+ pieces a day, 92.4% factuality F1, under 350 ms per piece.

AI Delivery Approach

Reasoning-model verifier on a LangGraph Platform graph — A persistent LangGraph workflow where a frontier extractor produces atomic claims and a reasoning model (routed by a LiteLLM AI gateway across o-series, Claude Sonnet 4 extended-thinking and DeepSeek-R1) verifies each one against GraphRAG and HippoRAG-grounded sources. Auto-publish only fires when the consensus scorer agrees.
Eval harness on commit — Ragas + Promptfoo + Inspect AI run a labelled regression set against HaluEval and an internal claim corpus on every prompt or graph change, so we catch a regression before it reaches staging.
Throughput without latency — Async batching, streaming retrieval, and careful cache warm-up to sustain 30K+ pieces a day without a single spike above 350 ms.
Feedback from the editors — Every editor correction flows through Langfuse traces back into the training set. Six months in, the model is catching classes of error it couldn’t on day one — and the AI gateway shows which model is paying its way.

What was actually hard

The easy version of this is a single LLM saying 'yes, this looks fine'. The production version had to hold factuality under real throughput, explain why it rejected a piece, and never silently approve a claim it couldn’t cite. Those three constraints fight each other — most of the work was keeping all three honest at once.

Project Outcome

Once live, the content team scaled to 30K+ pieces a day and kept factuality above 92%. Reviews that used to take hours now happen in minutes, and “does this look fine?” turned into a number the editors can point at.

> 92.4% factuality
F1 score > 30K+ pieces/day
verified > 78% auto-publish
rate > <350ms per-piece
latency

Analyst reviewing performance data on tablet

Neural network visualization for LLM grounding checks

LangGraph PlatformReasoning-model verifierGraphRAG / HippoRAGCohere Embed v4 + Rerank 3.5pgvectorscaleLiteLLM gatewayLangfuseRagasLlama Guard 4PythonFastAPIvLLMRedis Streams

“From kickoff to production in 6 weeks. The hallucination detection system handles 30,000 pieces a day without breaking a sweat.”