Multi-Agent AI Architecture: The Planner-Worker-Synthesizer Pattern

It's just a chatbot.

That's what most people think when they see a conversational AI assistant. You type something, it responds. How hard can it be?

I've spent the better part of a year finding out exactly how hard it can be. The answer is: unreasonably hard. Not because the AI isn't smart enough - it is. But because getting it to reliably do what you ask, every single time, without hallucinating results it never actually produced, is an architectural problem that no amount of prompt engineering can solve.

This is the story of how I broke my AI assistant into three separate agents - and why that decision fixed the single biggest reliability problem I had.

Meet Mimir

Before I get into the architecture, let me give you the full picture. I built an AI personal assistant called Mimir. The name comes from Norse mythology - Mimir is considered the wisest being alive, the keeper of knowledge. If you've played God of War, you know him as the fast-talking severed head strapped to Kratos's belt. That's the personality I was going for - deeply knowledgeable, a little opinionated, and always ready with an answer.

Mimir is a multi-agent system with a Python/FastAPI backend and a React/TypeScript frontend. It runs on AWS Bedrock using Claude as the underlying model. It communicates over WebSocket, streams responses in real-time, and currently sits at around 46,000 lines of code.

Here's what it can do: manage my calendar, read and draft emails, create and track todos, take and search notes, read and write files, execute code in a sandboxed REPL, search the web, manage projects with full discovery workflows, run multi-agent research with source verification, and even teach me guitar through AI personas modeled after legends like Hetfield and Petrucci. Over 80 tools across 12 categories.

It also has a six-layer memory system - session RAM, session memory, semantic memory, persistent memory, conversation store, and project RAM - that gives it the ability to remember things across conversations, track context within a session, resolve references like "those three emails" to actual email IDs, and maintain working context that expires when it's no longer relevant.

But the memory and context management system is a story for another day. What I really want to talk about is the pillar that holds all of this together - the three-agent pipeline that routes every single user request through Mimir.

Where It Started

Like most people building AI applications, I started simple. One LLM call. User sends a message, the model gets a system prompt with tool definitions, it decides what to do, calls some tools, and responds. Straightforward. And it worked. For a while.

When the application was small - a handful of tools, short conversations, minimal context - the single-agent approach was perfectly fine. The model would read my message, pick the right tool, call it, get the result, and give me a coherent answer. I was happy. I thought I had it figured out. Then I kept building.

The Problem Nobody Warns You About

As Mimir grew, so did the context window. More tools meant longer tool definitions. More features meant a longer system prompt. Conversation history accumulated. Memory retrievals got injected. Project state got loaded. The context that the model had to process before even getting to my actual question kept growing and growing. And then the hallucinations started.

Not the kind where the model makes up facts - I could handle that. These were tool call hallucinations. The model would decide it needed to call a tool, and instead of actually calling it, it would fabricate the result. It would generate a perfectly formatted tool response that looked real but never happened. It would tell me it created a todo, checked my calendar, or sent an email - and none of it actually occurred.

The first few times, I thought it was a fluke. Then it became a pattern.

Here's what made it maddening: when I called it out - "you didn't actually call that tool" - it would apologize, acknowledge the mistake, and then do the exact same thing again. It would generate another fake tool call result. I'd call it out again. Same thing. Over and over.

I tried everything. I strengthened the system prompt. I added explicit instructions: "You MUST call tools. Do NOT fabricate results." I added validation checks. I reformatted the tool definitions. I tried different temperatures, different models, different prompt structures. Nothing worked. And the reason nothing worked is because the problem wasn't the instructions - it was the noise.

Think about it. By the time the model gets to my actual request, it has already processed thousands of tokens of system instructions, tool definitions, conversation history, memory context, and session state. My actual message - "create a reminder for tomorrow at 9am" - is buried at the bottom of all of that. The model isn't ignoring my instructions. It's drowning in context, and the signal-to-noise ratio has collapsed.

No amount of prompt engineering fixes a structural problem.

The Insight

The fix wasn't to make the model smarter or the prompts better. It was to give the model less to think about. If the problem is that the tool-calling agent is drowning in context that has nothing to do with tool calling, then the solution is to create an agent whose only job is to call tools - and give it nothing else. No conversation history. No memory retrievals. No session state. No user preferences. Just: here's what you need to do, here are the tools you can use, go. That's how the three-agent pipeline was born.

The Three Agents

Every single request in Mimir flows through three stages: Plan, Execute, Synthesize.

The Planner

The Planner is the brain. It receives the full context - conversation history, memory retrievals, session state, user preferences, everything. Its job is to understand what the user wants and create a precise execution plan.

But here's the key: the Planner doesn't execute anything. It analyzes intent, identifies which tools are needed, figures out the right sequence, and packages all of that into clear instructions for the next agent. It runs at a low temperature for precision and has access to extended thinking - up to 32,000 tokens of internal reasoning before it even starts its response.

The Planner also has its own exclusive tools that no other agent can access: it can search my memories, search conversation history, and retrieve playbooks - behavioral rules for specific scenarios. These tools help it understand context and make better routing decisions, but they're read-only. The Planner observes and plans. It doesn't act.

It gets up to five rounds of tool calls to gather the context it needs before routing the request to an executor. Sometimes it needs to search my memories to understand a reference. Sometimes it needs to check conversation history to resolve ambiguity. But once it has enough context, it hands off a clean, specific instruction set.

The Worker

The Worker is the hands. It receives the Planner's instructions and the approved tool list - and nothing else. No conversation history. No memory context. No system state. Just: "Call the calendar tool to create an event titled X on date Y" or "Read the file at this path and analyze its contents."

This is the agent that was hallucinating before. And the reason it stopped hallucinating is because I removed everything that was causing the hallucinations. The Worker's context is clean. It sees its task, it sees its tools, and it executes. There's no noise to get lost in, no conversation history to confuse it, no accumulated context to degrade its performance.

The Worker operates in a loop - call a tool, evaluate the result, decide if more tool calls are needed, repeat. When it's done, it passes the raw results forward.

The Synthesizer

The Synthesizer is the voice. It takes the Worker's raw tool results and crafts the final response that the user sees. It runs at a higher temperature for more natural, conversational output. It handles formatting, tone, and presentation.

But here's the part I'm most proud of: the Synthesizer pre-validates. Before it even calls the LLM to generate a response, it checks whether the expected tool calls actually happened. If the Planner said "call the calendar tool" and the Worker's results don't contain a calendar tool response, the Synthesizer catches that before it can become a hallucinated answer.

This is the safety net. Even if something goes wrong in the Worker stage, the Synthesizer won't confidently present results that don't exist.

Why This Works

The magic isn't in any individual agent. It's in what each agent doesn't see. The Planner sees everything but can't act. It has full context to make good decisions, but it can't execute tools that change state. This means it can't hallucinate tool results because it's not calling execution tools in the first place.

The Worker can act but sees almost nothing. It has the tools and the instructions, but no context to get confused by. The signal-to-noise ratio is essentially 1:1. Every token in its context is directly relevant to the task at hand.

The Synthesizer sees results but validates before trusting. It doesn't blindly format whatever it receives. It checks that the work actually happened.

Each agent is deliberately limited. And those limitations are the architecture.

Compare this to the single-agent approach: one model sees everything, decides everything, executes everything, and presents everything. It's like asking one person to simultaneously be the project manager, the engineer, and the communications lead - while also remembering every conversation they've ever had. Of course it breaks down.

The Results

After implementing this pipeline, tool call hallucinations dropped to effectively zero. Not reduced - eliminated. The Worker calls tools or it doesn't. There's no middle ground where it pretends to call a tool because there's no conversational context pressuring it to appear helpful.

The system also became significantly easier to debug. When something goes wrong, I know exactly where to look. Bad plan? Planner issue. Tool call failed? Worker issue. Response doesn't match the results? Synthesizer issue. With a single agent, every failure was a mystery — was it the prompt? The context? The tool definition? The conversation history? Now each failure has an address.

And perhaps most importantly, each agent can be independently tuned. The Planner uses a precise, low-temperature configuration with extended thinking. The Worker uses whatever model and temperature is appropriate for the task. The Synthesizer uses a more creative configuration for natural responses. In a single-agent system, you pick one temperature and one model and hope it works for everything. It doesn't.

What This Means for You

If you're building AI applications - whether it's a customer-facing chatbot, an internal tool, or a product feature - and you're running into reliability issues with tool calling, I'd encourage you to look at your context before you look at your prompts.

Ask yourself: how much of what the model is processing is actually relevant to the task it's performing right now? If the answer is "maybe 20%," you have a noise problem. And noise problems are architectural problems.

You don't necessarily need three agents. But you probably need separation between planning and execution. The model that decides what to do should not be the same context that executes it - because the information needed to make good decisions is exactly the information that degrades execution.

Start simple. One agent is fine when your context is small. But the moment you notice the model "forgetting" to call tools, or fabricating results, or ignoring instructions that are clearly in the prompt - don't reach for a better prompt. Reach for a cleaner architecture.

The model is rarely the bottleneck. The architecture is.

This is the second article in a series about the architectural decisions behind building reliable AI applications. Read the first: When Your AI Agent Needs a Costume Change vs. a Whole New Actor. Originally published on LinkedIn.

Also worth reading: The Guardrails Problem — how to make the Worker agent safe — and Context Windows Are a Lie.