Agents2024-08-2211 min

Running Claude Code in Production: Architecture, Guardrails & Cost Control

Claude Code is the first coding agent that holds up across multi-day work in real repositories. The same properties that make it powerful — autonomy, tool access, long horizons — make it dangerous if you ship it like a chat box. This is how we run Claude Code fleets in production against monorepos with millions of lines of code, without merges going sideways.

[ TL;DR ]

[ 01 ]

The 4 properties every Claude Code deployment needs

Sandboxed execution, deterministic tool boundaries, full observability, and hard cost ceilings. Skip any of them and you will eventually wake up to a $40k weekend run, a force-pushed main branch, or a leaked secret.

Sandboxed execution — every agent runs in an ephemeral container with no host network and a read-only base image
Tool whitelist — only explicit tools (file edit, run tests, open PR) are exposed; shell escape is blocked
Observability — structured logs for every tool call, prompt, response, token spend and exit reason
Cost ceilings — per-task token and wall-clock budgets, enforced by the orchestrator, not by the model

[ 02 ]

Repository safety: the PR-only contract

Claude Code never pushes to a long-lived branch. Every run creates a fresh feature branch, opens a pull request, and stops. Humans review and merge. CI runs the same checks it would for a human contributor, including security scanning and license audit.

We additionally require a generated CHANGELOG entry and a short ‘reasoning trace’ summary on every PR so reviewers can audit the agent’s decisions, not just its diff.

[ 03 ]

Tool design: small, sharp, idempotent

The biggest leverage point on agent reliability is tool design. Each tool should do one thing, validate its inputs, and be safe to retry. Vague tools (‘run command’) invite hallucinated arguments and shell injection; sharp tools (‘run test file at path X with timeout Y’) collapse the failure surface.

We document tool contracts as JSON Schema and validate every call. Calls that fail validation are returned to the model with a structured error so it can self-correct without escalating to a human.

[ 04 ]

Evals: behavioral, not just functional

Functional tests prove the diff works. Behavioral evals prove the agent works. We maintain a corpus of ~200 realistic tasks per repository — bug fixes, refactors, doc updates, dependency upgrades — and replay them on every prompt or model change. We track success rate, average tokens, average wall clock, and human-review-required rate.

A new prompt template that bumps success by 4% but doubles token cost rarely ships. The eval suite makes that visible in minutes, not after a $20k surprise.

[ 05 ]

Cost control without crippling the agent

We attach a hard token budget and a soft ‘review checkpoint’ budget per task. At 60% of budget the agent must summarize progress and commit work-in-progress. At 100% the orchestrator kills the run and hands the partial PR to a human. This pattern catches runaway loops early and turns failed runs into useful artifacts instead of pure spend.

Cache aggressively. Prompt-cache the system message and repo map; that alone is typically a 35–50% cost reduction on long Claude Code sessions.

[ Key takeaways ]

01Sandbox, whitelist tools, observe everything, cap cost — non-negotiable
02Claude Code only ships through PRs reviewed by humans and CI
03Sharp, idempotent tools with JSON Schema validation beat generic shells
04Behavioral evals on a frozen task corpus prevent silent regressions

[ FAQ ]

Frequently asked questions

Can Claude Code replace junior engineers?

It absorbs the long tail of well-specified tasks — bug fixes, dependency upgrades, test coverage, refactors. It does not replace the judgment of designing a system, prioritizing a roadmap, or owning an incident.

How do you stop the agent from leaking secrets?

Secrets are injected into the sandbox at runtime, never present in the repo. The agent has no access to .env files at rest, outbound traffic is restricted to an allow-list of registries and APIs, and PRs are scanned for credentials before merge.

What does a typical Claude Code run cost?

On a mid-size codebase, a focused task (single bug fix or small feature) runs $1–$6 in API spend with caching. Multi-file refactors are typically $10–$40. Hard budgets prevent outliers.

[ Start your build ]

Deploy Claude Code in your stack — safely

We design the sandbox, the eval harness and the PR workflow that make autonomous coding agents trustworthy in real codebases.

Talk to an engineer