Book a 30-min call
cd ../blogs
$ cat posts/2026-04-09.mdx

The Agent Stack in 2026: From Chatbots to Autonomous Systems

April 9, 2026 · ImmovableTech Team

  • Agentic AI
  • MCP
  • A2A Protocol
  • Enterprise AI
  • Multi-Agent Orchestration

The Chatbot Era Is Over

Remember when “AI integration” meant bolting a ChatGPT wrapper onto your product? That was 2023. By 2024, the market figured out that chatbots without tools are just expensive autocomplete. The real value wasn’t in generating text — it was in AI that could do things: query databases, call APIs, modify files, trigger workflows.

That realisation kicked off the agentic revolution. And by early 2026, the dust has settled enough to see what the production agent stack actually looks like. It’s not one framework. It’s a layered architecture with protocols, orchestration engines, and operational tooling that looks surprisingly similar to how we built microservices a decade ago.

The Stack, Layer by Layer

Layer 1: Tool Protocol — MCP

The Model Context Protocol is the HTTP of agents. It defines how an AI model discovers and calls external tools via a standard JSON-RPC interface. Anthropic created it, the Linux Foundation governs it, and every major AI lab supports it. Ninety-seven million monthly SDK downloads as of early 2026.

Before MCP, integrating a tool meant writing custom code for each model provider’s function-calling API. OpenAI had one format, Anthropic had another, Google had a third. MCP collapsed that into a single standard. You build one MCP server, and any MCP-compatible model can use it.

The practical impact: we built an MCP server for a client’s internal knowledge base once, and it worked immediately with GPT-4o, Claude, and Gemini without any model-specific code.

Layer 2: Agent Communication — A2A

Google’s Agent-to-Agent protocol addresses a different problem: how do agents from different systems talk to each other? MCP is agent-to-tool. A2A is agent-to-agent.

The distinction matters at enterprise scale. Imagine a customer support agent that needs to check order status (handled by a logistics agent), verify payment (handled by a finance agent), and update the CRM (handled by a sales agent). Each agent might run in a different service, built by a different team, using a different model. A2A standardises how they negotiate tasks, share context, and report results.

We haven’t deployed A2A in production yet — the spec is newer — but we’re watching it closely. The pattern is inevitable: as agent systems grow, you need a protocol for inter-agent delegation, not just inter-agent prompt chaining.

Layer 3: Orchestration — LangGraph, CrewAI, Microsoft Agent Framework

This is where the “thinking” happens. Orchestration frameworks define the control flow: which agent runs first, what happens when it fails, how state passes between agents, and when a human gets pulled into the loop.

We’ve used all three major frameworks. Here’s our honest assessment:

LangGraph — Our default for production. It models agents as nodes in a directed graph with explicit state transitions. You get deterministic control flow, built-in persistence, and the ability to replay failed runs from the exact point of failure. The downside: the learning curve is steep. You’re building state machines, not writing prompts.

CrewAI — Great for prototyping and simpler workflows. You can define agents with plain-text roles and let the framework handle coordination. But the implicit orchestration becomes a liability in production. When something goes wrong, it’s hard to determine why one agent was invoked instead of another.

Microsoft Agent Framework — Newer, enterprise-focused, deeply integrated with Azure. We haven’t used it in production yet but it’s the likely choice for clients already on the Microsoft stack.

Layer 4: Models — Horses for Courses

The “one model to rule them all” era is over. Production agent systems typically use 2-3 models:

  • A fast model for simple routing and classification (GPT-4o-mini, Claude 3.5 Haiku)
  • A reasoning model for complex decisions and multi-step planning (GPT-4o, Claude 3.5 Sonnet)
  • A domain-specific model for specialised tasks (fine-tuned models, SLMs like Phi-4 for edge deployment)

The key metric isn’t accuracy per model — it’s cost per completed task. A system that routes 80% of requests to a fast model and 20% to a reasoning model costs a fraction of one that sends everything to the expensive model. This is where FinOps for AI comes in.

Layer 5: Observability — You Can’t Debug What You Can’t See

Agent systems are non-deterministic. The same input can produce different tool-calling sequences, different intermediate results, and different final outputs depending on model temperature, tool response time, and context window state.

Debugging this without observability is like debugging a distributed system without logs — technically possible, practically insane. LangSmith has become our standard. It traces every agent run end-to-end: model calls, tool invocations, intermediate reasoning, costs, and latencies.

The critical capability is eval-driven development. We maintain evaluation datasets for each agent system and run them on every deployment. If accuracy drops below a threshold, the deployment stops. This is the agent equivalent of unit tests, and it’s the single most important practice for shipping reliable agent systems.

The Economics: FinOps for Agents

Running agent fleets that make thousands of LLM calls daily gets expensive fast. The Plan-and-Execute pattern — where a capable model creates a strategy and a cheaper model executes it — cut our API costs by 60% on one project.

Other patterns that help:

  • Semantic caching: If the same question has been asked before, return the cached answer instead of re-running the agent. We use Redis with embedding-based similarity matching.
  • Token budgets per run: Each agent invocation has a maximum token spend. If the agent is spiralling (calling 40 tools instead of 4), the run terminates with a partial result.
  • Model tiering: Route requests based on complexity. A “what’s the weather?” question doesn’t need GPT-4o.

Gartner’s 40% Prediction

Gartner predicts 40% of enterprise applications will include task-specific AI agents by end of 2026, up from less than 5% in 2025. That’s aggressive, but the trajectory is real. The companies we work with are moving past “should we build agents?” and into “how do we build agents that don’t break in production?”

The gap is not in model capability. Models are good enough. The gap is in engineering: orchestration that handles failures gracefully, observability that makes non-deterministic systems debuggable, cost controls that keep budgets predictable, and evaluation frameworks that catch regressions before users do.

That’s the agent stack in 2026. It’s not one framework or one model. It’s a discipline — with protocols (MCP, A2A), orchestration (LangGraph), operational tooling (LangSmith), and engineering practices (evals, FinOps, tiered routing) that together make autonomous AI systems reliable enough to trust in production.


We build production agent systems as part of our AI & Machine Learning Engineering practice. Talk to us about building agents that actually ship.