Book a 30-min call
$ cd ~/projects/hallucination-detection agent.shipped · in production

LLM Safety
at Scale.

30K+ pieces of LLM-generated content a day,
each one checked by a frontier extractor and a
reasoning-model verifier before it reaches a reader.
92.4% factuality F1, sub-350 ms p95 per claim,
Ragas + Promptfoo regressions on every PR.

  • Home
  • Hallucination detection & content verification
Hallucination Detection project visualization

Hallucination detection & content verification

Industry
Enterprise AI
Timeline
6 weeks
Key result
92.4% factuality F1
Tech stack
LangGraph Platform, reasoning-model verifier, GraphRAG / HippoRAG, Cohere Embed v4 + Rerank 3.5, pgvectorscale, LiteLLM gateway, Langfuse, Ragas, Python, FastAPI, vLLM, Redis Streams

The enterprise content team was drowning in AI-drafted copy with no way to catch factual drift until a reader complained. We built a pipeline that does that work before the copy ever ships.

GPT-4o pulls claims out of each draft, Claude 3.5 Sonnet verifies them against retrieved sources, and a consensus scorer decides what’s safe to auto-publish and what needs a human. 30K+ pieces a day, 92.4% factuality F1, under 350 ms per piece.

AI Delivery Approach
  • Reasoning-model verifier on a LangGraph Platform graph — A persistent LangGraph workflow where a frontier extractor produces atomic claims and a reasoning model (routed by a LiteLLM AI gateway across o-series, Claude Sonnet 4 extended-thinking and DeepSeek-R1) verifies each one against GraphRAG and HippoRAG-grounded sources. Auto-publish only fires when the consensus scorer agrees.

  • Eval harness on commit — Ragas + Promptfoo + Inspect AI run a labelled regression set against HaluEval and an internal claim corpus on every prompt or graph change, so we catch a regression before it reaches staging.

  • Throughput without latency — Async batching, streaming retrieval, and careful cache warm-up to sustain 30K+ pieces a day without a single spike above 350 ms.

  • Feedback from the editors — Every editor correction flows through Langfuse traces back into the training set. Six months in, the model is catching classes of error it couldn’t on day one — and the AI gateway shows which model is paying its way.

What was actually hard

The easy version of this is a single LLM saying 'yes, this looks fine'. The production version had to hold factuality under real throughput, explain why it rejected a piece, and never silently approve a claim it couldn’t cite. Those three constraints fight each other — most of the work was keeping all three honest at once.

AI assistant interface on monitor

Project Outcome

Once live, the content team scaled to 30K+ pieces a day and kept factuality above 92%. Reviews that used to take hours now happen in minutes, and “does this look fine?” turned into a number the editors can point at.

> 92.4% factuality
F1 score
> 30K+ pieces/day
verified
> 78% auto-publish
rate
> <350ms per-piece
latency
Analyst reviewing performance data on tablet
Neural network visualization for LLM grounding checks
LangGraph PlatformReasoning-model verifierGraphRAG / HippoRAGCohere Embed v4 + Rerank 3.5pgvectorscaleLiteLLM gatewayLangfuseRagasLlama Guard 4PythonFastAPIvLLMRedis Streams

“From kickoff to production in 6 weeks. The hallucination detection system handles 30,000 pieces a day without breaking a sweat.”

@ Tom H.

VP Product — Enterprise Content Platform

AI workflow interface in dark mode
  • [GraphRAG] grounding
  • [Reasoning model] verifier
  • [Ragas] evals
  • [Langfuse] tracing
  • [Cohere Rerank] retrieval
  • [Llama Guard] safety