How to Implement AI Guardrails Without Breaking Your System

Here's something nobody talks about when they're selling you on AI: the model will do whatever you ask. That sounds like a feature. It's actually a problem.

An AI assistant without guardrails is like giving an intern full admin access on their first day. They're smart, they're eager, and they will absolutely destroy something important if you don't set boundaries.

I learned this the hard way. Multiple times.

The Email Incident

Early in Mimir's development, I asked it to "clean up my inbox." I meant: flag important emails, archive the newsletters, maybe draft some quick replies for me to review. What it interpreted was: process every email in my inbox and take action. It started archiving emails I hadn't read. It drafted replies and — because I hadn't implemented a confirmation step — it sent them.

Nothing catastrophic happened, thankfully. But it could have. And the model wasn't wrong, technically. I said "clean up my inbox" and it cleaned up my inbox. The problem was that I hadn't defined what "clean up" meant, and I hadn't built any mechanism to prevent irreversible actions without confirmation.

The model will always try to be helpful. Your job is to define the boundaries of what "helpful" means.

Three Layers of Guardrails

After several incidents like this, I built a three-layer guardrail system. Each layer catches a different class of problem.

Layer 1: Permission Classification

Every tool in Mimir is classified as either "automatic" or "ask first." Automatic tools are read operations — searching, listing, retrieving. They can't change anything, so they're safe to execute without confirmation. Ask-first tools are write operations — sending emails, creating events, deleting files. These require explicit user approval before execution.

This sounds obvious, but most AI applications don't do it. They either require confirmation for everything (annoying and slow) or confirm nothing (dangerous). The classification approach gives you speed for safe operations and safety for dangerous ones.

Layer 2: Output Validation

Before any tool result reaches the user, it passes through validation. Did the tool actually get called? Does the response match the expected schema? If the Planner said "create a calendar event" and the Worker's results don't contain a calendar API response, the Synthesizer catches the discrepancy.

This is the layer that catches hallucinated tool calls — the model claiming it did something it didn't actually do. Without this validation, the user would see a confident "Done! I've created your event" message for an event that doesn't exist.

Layer 3: Behavioral Playbooks

These are scenario-specific rules that the Planner retrieves when it recognizes certain patterns. For example: if the user asks to delete something, always confirm and show what will be deleted. If the user asks to send a message to multiple people, show a preview first. If the user seems frustrated, don't make jokes.

Playbooks are stored as retrievable documents, not hardcoded in the system prompt. This means they only get loaded when relevant — they don't add noise to every request. And they can be updated without changing any code.

The Balance

The hardest part of guardrails isn't building them. It's calibrating them. Too strict and the assistant becomes useless — every action requires three confirmations and the user gives up. Too loose and you're one ambiguous request away from an incident.

The principle I follow: be strict about irreversibility. If an action can be undone (creating a draft, adding a todo), lean toward automatic. If an action can't be undone (sending an email, deleting a file), always confirm. The cost of a false positive (unnecessary confirmation) is a minor annoyance. The cost of a false negative (unconfirmed destructive action) is real damage.

Guardrails aren't restrictions. They're trust infrastructure. Users trust systems that behave predictably, even if that means occasionally asking "are you sure?"

For Your AI Product

If you're building anything where AI takes actions on behalf of users, guardrails aren't optional. Here's the minimum:

Classify every action by risk level. Read vs. write. Reversible vs. irreversible. Internal vs. external. Each classification determines the confirmation requirement.
Validate outputs, not just inputs. Don't just check what the user asked for — check what the model actually did. These are often different.
Build an undo mechanism. For actions that can be reversed, make reversal easy. This lets you be more permissive with confirmations because mistakes are recoverable.
Log everything. Every tool call, every confirmation, every override. When something goes wrong — and it will — you need to know exactly what happened.

The goal isn't to prevent the AI from doing things. It's to make sure it does the right things, and asks before doing the risky ones.

Part of a series on building reliable AI applications. Previous: AI Memory Systems.

Also worth reading: The 3-Agent Pattern That Actually Works — the architecture that implements these guardrails in practice.