Agents2024-06-1810 min

Designing Custom AI Agents That Actually Ship

Demo agents look magical and production agents look boring — and that is exactly the point. The agents we ship to customers do not improvise; they execute a constrained plan over typed tools with durable state, and they hand off to humans when the plan fails. This is how we design them.

[ TL;DR ]

[ 01 ]

The planner-executor loop

We split every agent into a planner and an executor. The planner reads the goal and current state, then emits a structured plan — a list of typed steps with arguments. The executor runs each step, captures the result, and feeds it back to the planner for the next decision.

This separation makes the agent debuggable. You can inspect plans, replay them, mock executors, and run A/B tests on planning prompts without touching the tools.

[ 02 ]

Durable state with Temporal or LangGraph

Long-running agents on top of a chat-style API are fragile: a single timeout loses the entire conversation. We run agents as durable workflows on Temporal or LangGraph so state survives process restarts, retries are exactly-once, and the entire history is queryable.

Durability is the single biggest difference between an impressive demo and a system you can put on a customer-facing path.

[ 03 ]

Typed tools, contract-tested

Every tool the agent can call is defined by a JSON Schema and unit-tested independently. The agent never composes shell commands or builds SQL strings — those are explicit tools with validated arguments. This collapses the attack surface and makes regressions diagnosable.

We keep tool surfaces small. A 6-tool agent that succeeds 95% of the time beats a 40-tool agent that succeeds 60% every day of the week.

[ 04 ]

Memory: three layers, no magic

Working memory is the active conversation and current plan. Episodic memory is the durable log of past tasks, queried by similarity when relevant. Semantic memory is the agent’s long-term knowledge — usually a RAG index over the team’s documents and prior outputs.

Avoid the temptation to dump everything into one vector store. The three layers serve different needs and degrade differently under load.

[ 05 ]

Human handoff is a first-class feature

Every production agent ships with explicit confidence thresholds and escalation paths. When the planner is uncertain, the agent pauses the workflow, posts the current state to a human reviewer, and resumes from the reviewer’s decision.

Agents that never hand off are agents that eventually do something embarrassing in front of a customer.

[ Key takeaways ]

01Split planner and executor — it is the foundation of a debuggable agent
02Run agents as durable workflows so state survives restarts and retries
03Define every tool with JSON Schema; keep the toolset small and sharp
04Three memory layers (working, episodic, semantic) outperform one giant index

[ FAQ ]

Frequently asked questions

Which agent framework should I use?

LangGraph for Python-heavy teams that want durable, branchable workflows. Temporal for orgs already running on Temporal. The framework matters less than the discipline of typed tools and durable state.

How do you evaluate an agent?

A frozen task corpus with golden outputs, an LLM-as-judge for nuanced cases, and a human-review-required rate as a guardrail. Every prompt or model change replays the corpus and posts a diff before merge.

When should I NOT build an agent?

When the task is a single LLM call. Wrapping a function call in an agent adds latency, cost and failure modes for no gain. Agents earn their complexity on multi-step tasks with state.

[ Start your build ]

Ship a custom AI agent that survives production

We design durable, typed-tool agents with eval harnesses and human handoff — built around your stack.

Design your agent