LLM Safety
at Scale.
30K+ pieces of LLM-generated content a day,
each one checked by a frontier extractor and a
reasoning-model verifier before it reaches a reader.
92.4% factuality F1, sub-350 ms p95 per claim,
Ragas + Promptfoo regressions on every PR.
Hallucination detection & content verification
The enterprise content team was drowning in AI-drafted copy with no way to catch factual drift until a reader complained. We built a pipeline that does that work before the copy ever ships.
GPT-4o pulls claims out of each draft, Claude 3.5 Sonnet verifies them against retrieved sources, and a consensus scorer decides what’s safe to auto-publish and what needs a human. 30K+ pieces a day, 92.4% factuality F1, under 350 ms per piece.
AI Delivery Approach
-
Reasoning-model verifier on a LangGraph Platform graph — A persistent LangGraph workflow where a frontier extractor produces atomic claims and a reasoning model (routed by a LiteLLM AI gateway across o-series, Claude Sonnet 4 extended-thinking and DeepSeek-R1) verifies each one against GraphRAG and HippoRAG-grounded sources. Auto-publish only fires when the consensus scorer agrees.
-
Eval harness on commit — Ragas + Promptfoo + Inspect AI run a labelled regression set against HaluEval and an internal claim corpus on every prompt or graph change, so we catch a regression before it reaches staging.
-
Throughput without latency — Async batching, streaming retrieval, and careful cache warm-up to sustain 30K+ pieces a day without a single spike above 350 ms.
-
Feedback from the editors — Every editor correction flows through Langfuse traces back into the training set. Six months in, the model is catching classes of error it couldn’t on day one — and the AI gateway shows which model is paying its way.
What was actually hard
The easy version of this is a single LLM saying 'yes, this looks fine'. The production version had to hold factuality under real throughput, explain why it rejected a piece, and never silently approve a claim it couldn’t cite. Those three constraints fight each other — most of the work was keeping all three honest at once.

Project Outcome
Once live, the content team scaled to 30K+ pieces a day and kept factuality above 92%. Reviews that used to take hours now happen in minutes, and “does this look fine?” turned into a number the editors can point at.
F1 score > 30K+ pieces/day
verified > 78% auto-publish
rate > <350ms per-piece
latency


“From kickoff to production in 6 weeks. The hallucination detection system handles 30,000 pieces a day without breaking a sweat.”
@ Tom H.
VP Product — Enterprise Content Platform



