$ cat posts/2026-04-17.mdx

MCP in Production: Building Agentic AI Systems That Actually Ship

April 17, 2026 · ImmovableTech Team

MCP
Agentic AI
LangGraph
Multi-Agent Systems
Production AI

The Protocol That Changed How We Build AI

If you’ve been building AI systems for the past year, you’ve probably noticed a shift. The conversation moved from “which model should I use?” to “how do I connect my model to everything else?” That’s the gap MCP fills.

Model Context Protocol — originally Anthropic’s, now governed by the Linux Foundation’s Agentic AI Foundation — has become the default way AI agents talk to tools, databases, and data sources. 97M+ monthly SDK downloads, 10,000+ published servers, adoption from OpenAI, Google, Microsoft and AWS. It’s boring plumbing in the best possible sense.

We’ve shipped MCP-based systems to production for three clients so far. Here’s what we learned that doesn’t show up in the docs.

What MCP Actually Does (and Doesn’t Do)

MCP is a JSON-RPC protocol that standardises how an AI model calls external tools. Before MCP, every integration was custom: you’d write a function, register it with your framework, handle auth, parse the response, and pray the model called it correctly. With MCP, tool providers publish servers, and any MCP-compatible client can discover and call those tools.

What MCP doesn’t do: it doesn’t orchestrate. It doesn’t decide which tools to call, in what order, or what to do with the results. That’s where frameworks like LangGraph come in.

Put differently: MCP defines how tools expose themselves. LangGraph decides when and in what order to use them.

Our Production Stack

After three deployments, we’ve settled on a stack that works:

Agent Orchestration:  LangGraph (deterministic state machines)
Tool Protocol:        MCP (JSON-RPC over Streamable HTTP)
Models:               GPT-4o (fast tasks), Claude 3.5 Sonnet (complex reasoning)
Observability:        LangSmith (traces, evals, cost tracking)
Serving:              FastAPI + Redis queues

We tried CrewAI early on. It’s great for demos — you can spin up a multi-agent system in 20 lines. But in production, we needed explicit control over agent handoffs, retry logic, and state persistence. LangGraph’s graph-based approach gave us that. When an agent fails mid-pipeline, we know exactly which node failed, what state it was in, and can retry from that point.

The bits the docs skip

Tool Registration Is the Easy Part. Schema Design Is Not.

Registering an MCP tool takes five minutes. Designing a tool schema that models actually use correctly? That takes days. We learned this the hard way on our restaurant intelligence platform.

Our first version of the NL-to-SQL tool had a parameter called query that accepted a free-form SQL string. The model generated valid SQL about 60% of the time. When we restructured the tool to accept structured parameters — table_name, columns, filters, group_by — accuracy jumped to 94%. The model was fine at reasoning; it was bad at writing raw SQL from scratch.

The lesson: design tool schemas for the model, not for a human developer. Models work better with constrained, structured inputs than open-ended strings.

Multi-Agent ≠ Better

Gartner predicts 40% of enterprise apps will embed AI agents by end of 2026. But most of those should be single-agent systems. We’ve built multi-agent pipelines, and the honest truth is: they’re harder to debug, slower to execute, and only justified when tasks genuinely require different capabilities.

Our hallucination detection pipeline uses five agents because claim extraction, source retrieval, and factuality scoring are genuinely different tasks that benefit from specialisation. Our content generation system started with four agents and we merged it down to two — the “editor” and “reviewer” agents were doing overlapping work and adding 3 seconds of latency for no accuracy gain.

The rule of thumb: start with one agent. Add another only when you can prove (with evals) that splitting the task improves output quality enough to justify the added complexity and latency.

Observability Is Not Optional

Non-deterministic systems are terrifying to operate without observability. When a traditional API returns a wrong answer, you check the logs and find the bug. When an agent returns a wrong answer, the “bug” might be a model hallucination, a tool returning unexpected data, a prompt that doesn’t handle an edge case, or a combination of all three.

LangSmith has been essential for us. We trace every agent run end-to-end: which tools were called, what the model “thought” (its chain-of-thought), what each tool returned, and how long each step took. When something goes wrong, we can replay the exact sequence and identify whether the issue was in the model’s reasoning, the tool’s response, or our prompt.

The cost tracking alone justified the tool. One client’s agent was making 40 tool calls per request instead of the expected 4-6, burning through their API budget. We only caught it because LangSmith showed the call pattern.

A2A: The Other Protocol You’ll Need

MCP handles agent-to-tool communication. Google’s Agent-to-Agent (A2A) protocol handles agent-to-agent communication. They’re complementary, and both now live under the same Linux Foundation governance.

We haven’t deployed A2A in production yet — the spec is newer and the tooling is less mature. But the pattern is clear: as systems grow from “one agent with tools” to “multiple agents collaborating,” you need a standard for how Agent A delegates a subtask to Agent B, how Agent B reports back, and how both handle failures.

For now, we handle inter-agent communication through LangGraph’s built-in node-to-node state passing. When A2A tooling matures, we’ll likely migrate the cross-service agent communication to it.

What We’d Build Differently

If we started our first MCP project today instead of six months ago:

Remote MCP servers from day one. We started with local stdio-based servers and had to migrate to remote HTTP servers for production. The ecosystem has converged on remote, OAuth-secured servers. Start there.
Structured tool outputs, not free text. Our early tools returned plain text descriptions. Models parse structured JSON far more reliably. Every tool should return typed, schema-validated responses.
Cost budgets per agent run. We now set a maximum token spend per request. If the agent exceeds it, the run terminates gracefully with a partial result rather than spinning indefinitely. This should have been in place from the start.
Evaluation datasets before architecture. We spent two weeks building an agent pipeline before we had a way to measure if it worked. Now we build the eval set first, always.

Where we’ve actually landed

MCP is real infrastructure, not a trend. The protocol is stable, the tooling is maturing fast, and the big labs are all-in. If you’re building AI systems that have to touch real databases, real APIs, real file systems, or real internal tools, MCP is the honest way to connect them.

But the protocol is just the plumbing. The hard work sits in schema design, orchestration, evaluation and observability. Those are engineering problems, not API problems — and they’re where production systems quietly succeed or quietly fall over.

We build MCP-based agentic systems as part of our AI & Machine Learning Engineering practice. Talk to us if you’re building something similar.