
LLM Safety
Hallucination Detection & Content Safety
A two-model consensus loop that catches factual errors before readers do. 30K+ pieces a day, 92.4% F1.
LLMs break in weird ways once real users show up.
We build the Ragas / Promptfoo evals, the GraphRAG grounding,
the Llama Guard / ShieldGemma guardrails, and the
Langfuse / Arize Phoenix observability that catches it
— then the vLLM, SGLang and TensorRT-LLM serving
that keeps catching it at 3 a.m. on a Sunday.
Model design, training loops, eval harnesses (Ragas, Promptfoo, DeepEval, Inspect AI), inference on vLLM / SGLang / TensorRT-LLM with continuous batching and speculative decoding, MCP tool servers, drift monitors and Langfuse / Arize Phoenix tracing — the lot. If it starts with "can we use AI for this?" and ends with a pager rotation, we have probably done it.
We obsess about the unglamorous bits: labeling quality, cost per call after an AI gateway (Portkey, LiteLLM, Cloudflare AI Gateway, Bedrock Gateway) routes the request, what happens when the retriever returns nothing, what your ops team does when a model misfires at 11pm. Those are the things that decide whether an AI feature survives its first quarter.
Frame the problem honestly — If an 80-line Python script would solve it, we'll tell you. If a small fine-tune beats a frontier model, we'll tell you that too. Then we scope what ML actually adds.
Build small, evaluate often — Reproducible pipelines, held-out sets, Ragas / Promptfoo / Inspect AI suites on day one, BIRD-SQL or HaluEval baselines where they fit. No magic runs.
Ship with a seatbelt on — Llama Guard 4, ShieldGemma, NeMo Guardrails or Bedrock Guardrails on the input and output, drift alerts on Langfuse / Arize Phoenix, rollback triggers, and a rehearsed failure mode before we let it near a user.
Keep it sharp after launch — Flags, A/Bs, human feedback loops, a labelled regression set on every prompt change, and a quarterly review. Models rot if you don't look at them.

A week. We look at your data, pick the metric that actually matters, and write down what "done" means — eval set, latency budget, cost ceiling — before any training starts.
Held-out sets, error analysis, a shared eval dashboard on Ragas / Promptfoo / Inspect AI / BrainTrust. If a Chronos / GPT-class baseline beats the bespoke model, we say so.
Inference on vLLM / SGLang / TensorRT-LLM behind an AI gateway, Llama Guard 4 / ShieldGemma guardrails, rollback plan, runbook. We rehearse the first bad response before a user ever sees one.
Langfuse / Arize Phoenix tracing, drift alerts, weekly review, and a priority queue that lines up the next experiment. Models stay sharp, not just launched.
Got a model that's stuck in notebooks? Send us the brief
If your question isn’t here, email us. We read everything that comes in.
Data prep, training, eval, deployment and the on-call pager rotation if you want it. LLM features, agentic systems on MCP, reasoning-model verifiers, classic ML — whichever fits the problem, not whichever is trendy this quarter.
What that usually looks like:You can’t zero them out, but you can cage them. We use grounded retrieval (GraphRAG / HippoRAG), structured outputs, automated eval suites (Ragas, Promptfoo), guardrails (Llama Guard 4, ShieldGemma, NeMo Guardrails) and one piece of human review where it counts — so every answer is traceable to a source or held back.
What that usually looks like:Yes — we've shipped on AWS, GCP and Azure, on top of Databricks, Snowflake, Redshift and most of the vector DB ecosystem (pgvector / pgvectorscale, Qdrant, Weaviate, Vespa, Turbopuffer, Pinecone, Milvus). We fit into what you run; we don’t force a migration to justify the invoice.
What that usually looks like:Most teams see a thin slice in production inside six weeks. Not a demo, not a Friday-afternoon Streamlit — a real endpoint serving real traffic behind a flag, with Langfuse traces and a Ragas eval set running on every deploy.
What that usually looks like:Langfuse, Arize Phoenix, LangSmith or Datadog LLM Observability for prompts and outputs (where it’s legal), latency and cost on a visible dashboard via the AI gateway, drift alerts that wake the right person, and a quality regression test that runs on every deploy. The monitoring is something your team can actually read, not a black box we keep.
What that usually looks like:Only if you want us to. Most clients sign a light retainer for retraining, eval updates and roadmap. A few take the handover and run it in-house — we write the docs with that exit in mind either way.
What that usually looks like:Hi, I'm the ImmovableTech assistant. Ask me about our services, past projects, or how to get in touch.