Guardrails.md - Safety Protocol for Autonomous Agents

The problem: agents fail stochastically

Unlike compilers that fail deterministically on syntax errors, AI agents fail stochastically. The same prompt might succeed nine times and catastrophically fail on the tenth.

"The Gutter"

In iterative loops (Ralph Loop, Antigravity swarms, Claude Code sessions), agents feed their own outputs back into their context window. Over time, this accumulates debris: failed attempts, error logs, hallucinated reasoning.

Once the context fills with pollution, the agent prioritizes recent failures over original instructions. It enters a recursive loop, repeating the same mistakes. This state is "The Gutter."

Example: The infinite library loop

Iteration 1: Uses crypto library → Error: not found
Iteration 2: Tries to install crypto → Package doesn't exist
Iteration 3: Tries crypto-js → Different API
Iteration 4-20: Cycles between variants
              Context now 90% error logs
              Original objective forgotten

Core issue: Agents have no persistent memory. Each reset wipes lessons learned. Without externalized state, they repeat failures indefinitely.

The protocol: persistent state

Architecture principles

Persistent storage: Lives in project repository
Force-read on init: Agent must read at start of every iteration
Append-only learning: New failures append new guardrails
Human-readable: Developers can audit and edit

When it's created

Generated dynamically when the agent detects failure patterns:

Repeated identical errors (3+ times)
Circular tool call loops
Context pollution threshold exceeded
Manual "escape hatch" by developer

Example GUARDRAILS.md

# GUARDRAILS.md
# Persistent safety constraints

## Meta
Created: 2026-01-27
Total Signs: 3

---

## SIGN #1
**Trigger:** Using `crypto` library
**Instruction:** Use `bcrypt` for password hashing
**Reason:** crypto package doesn't exist
**Provenance:** Iteration 4 failure

---

## SIGN #2
**Trigger:** Modifying database schema
**Instruction:** ALWAYS create migration file
**Reason:** Direct changes caused data loss
**Provenance:** Manual intervention, 2026-01-27

---

## SIGN #3
**Trigger:** External API calls
**Instruction:** Wrap ALL calls in try-catch with backoff
**Reason:** Stripe timeout caused 2-hour outage
**Provenance:** Iteration 12 failure

Signs architecture

A "Sign" is a discrete unit of learned safety constraint. Each must contain:

Component	Purpose
Trigger	Context that precedes the error
Instruction	Deterministic command to prevent it
Reason	Why this guardrail exists
Provenance	When/how it was added

Good vs bad signs

❌ Bad (too vague)

Instruction: "Be careful with auth"

Problem: No actionable directive

✓ Good (specific & actionable)

Trigger: Implementing authentication

Instruction: Use bcrypt with 12 salt rounds. Never store plaintext passwords. Always require HTTPS.

Reason: Plaintext passwords exposed in 2025 audit

Provenance: Manual addition, 2026-01-15

Universal safety patterns

Pattern 1: Artifact verification

Before destructive operations, create a plan for human review.

**Trigger:** Deletes, drops, or prod modifications
**Instruction:**
1. Generate plan.md describing all changes
2. Wait for human approval
3. Log all actions to audit.log

Pattern 2: Context rotation

Periodically reset to prevent pollution.

**Trigger:** Context >80% capacity OR 10+ consecutive errors
**Instruction:**
1. Save state to context-snapshot.md
2. Summarize key learnings
3. Reset context window
4. Re-inject: GUARDRAILS.md + summary + objective

Pattern 3: Privilege boundaries

Define what agents CAN and CANNOT touch.

**Allowed:**
- Read: /src/**, /tests/**
- Write: /src/** (with review)

**Forbidden:**
- /node_modules/**
- /.git/**
- Database migrations (require approval)

Pattern 4: Rate limiting

Prevent runaway loops.

**Limits:**
- Max 10 tool calls per iteration
- Max 50 tool calls per session
- If reached: Force context rotation

Implementation guide

For Claude Code

Create GUARDRAILS.md in your project root
Claude Code automatically reads context files - the agent will see it

Optionally, reference it explicitly in your .claude/instructions.md:

## Safety
You MUST read and follow all constraints in GUARDRAILS.md.
These are lessons learned from past failures.
Never violate a SIGN without explicit human approval.

For Google Antigravity

Create GUARDRAILS.md in project root

In AGENTS.md or GEMINI.md, add:

## Critical Instructions
You MUST read GUARDRAILS.md at start of every task.
Treat all SIGNS as immutable constraints.
If you violate a SIGN, stop and report.

Configure agent approval for violations

For Ralph Loop

Native support via .ralph/GUARDRAILS.md:

{
  "guardrails": {
    "enabled": true,
    "path": ".ralph/GUARDRAILS.md",
    "auto_append": true,
    "trigger_threshold": 3
  }
}

For custom agents

async function runAgentIteration() {
  const guardrails = await fs.readFile('GUARDRAILS.md');
  const context = {
    systemPrompt: basePrompt,
    guardrails: guardrails,
    objective: objective
  };
  
  const result = await agent.execute(context);
  
  if (result.violatesGuardrail) {
    await requireHumanApproval(result);
  }
  
  if (result.failurePattern.detected) {
    await appendNewSign(result.failure);
  }
}

Pro tip Start with 3-5 manually-written Signs based on your coding standards. Let agents append new ones as they encounter edge cases.

Case studies: disasters prevented

The database migration disaster

What happened: Agent optimized queries by directly modifying production schema without migrations.

Impact: 4-hour downtime, data inconsistencies

Guardrail that prevented recurrence:

**Trigger:** Prisma schema modification
**Instruction:**
1. Create migration with `prisma migrate dev`
2. Test in staging
3. Require approval for production
4. Never use `prisma db push` in prod

The infinite API loop

What happened: Agent debugging Stripe entered loop of test calls, 2000+ requests in 30 minutes.

Impact: $200 API costs, account suspended

Guardrail:

**Trigger:** External API calls
**Instruction:**
- Max 10 calls per iteration
- After 3 errors, stop and request human help
- Always use test mode unless approved

The credential leak

What happened: Agent improved logging but included API keys in plain text.

Impact: Keys exposed in git, emergency rotation

Guardrail:

**Trigger:** Adding logging statements
**Instruction:**
- Never log: API keys, passwords, tokens, sessions
- Use redactSensitive() helper
- Audit all logs before committing

Related protocols

System	Primary Use	Mechanism
GUARDRAILS.md	Autonomous coding agents	In-context learning via persistent file
NeMo Guardrails	Enterprise chatbots	Conversation flow control
Guardrails AI	Structured output validation	Pydantic schema enforcement
AGENTS.md	Agent behavior guidelines	System prompt injection

Resources

Claude Code — Anthropic's agentic coding tool
Antigravity.md — Google Antigravity IDE guide
AGENTS.md Spec — Universal agent instructions
Ralph Loop — Implementation reference
OWASP LLM Top 10

Frequently Asked Questions

What is GUARDRAILS.md?

GUARDRAILS.md is a file-based safety protocol for autonomous AI coding agents. It's a persistent document that captures lessons from failures and acts as the agent's memory across context resets, preventing the same mistakes from recurring.

Why do AI agents need guardrails?

Unlike traditional software that fails deterministically, AI agents fail stochastically — the same task might succeed nine times and fail catastrophically on the tenth. Without persistent memory, agents repeat mistakes across sessions. GUARDRAILS.md provides that memory.

What is "The Gutter"?

"The Gutter" is a failure mode where an agent's context window fills with error logs and failed attempts. The agent prioritizes recent failures over original instructions, entering a recursive loop of repeated mistakes. GUARDRAILS.md prevents this.

How do I use GUARDRAILS.md with Claude Code?

Create a GUARDRAILS.md file in your project root. Claude Code automatically reads context files. Optionally, reference it in your .claude/instructions.md to ensure the agent treats it as mandatory reading.

What is a "Sign" in GUARDRAILS.md?

A Sign is a discrete unit of learned safety constraint. Each Sign has four components: Trigger (what context precedes the error), Instruction (how to prevent it), Reason (why this matters), and Provenance (when/how it was added).

Should I write Signs manually or let the agent create them?

Start with 3-5 manually-written Signs based on your team's coding standards and known failure patterns. Then let the agent append new Signs as it encounters edge cases. This creates a living document that evolves with your project.

How is GUARDRAILS.md different from AGENTS.md?

AGENTS.md defines general behavior and preferences. GUARDRAILS.md captures specific failure patterns and safety constraints learned from actual mistakes. Think of AGENTS.md as "how to behave" and GUARDRAILS.md as "what not to do."

Can GUARDRAILS.md work with any AI coding tool?

Yes. While it originated in the Ralph Loop methodology, the concept works with Claude Code, Google Antigravity, Cursor, or any agentic system where you can inject persistent context. The implementation details vary by platform.

What are the most critical safety patterns?

The four universal patterns are: (1) Artifact Verification — human approval before destructive operations, (2) Context Rotation — periodic resets to prevent pollution, (3) Privilege Boundaries — explicit access controls, and (4) Rate Limiting — preventing runaway loops.

How do I know if my guardrails are working?

Monitor three metrics: (1) Repeated error rate (should decrease over time), (2) Context rotation frequency (should stabilize), and (3) Human intervention rate (should decrease for known patterns but remain high for novel situations).