You're Automating Too Early

Every team building agents hits the same question eventually. The workflow you want an agent to run: should it be code, or should it be a markdown file you hand to Claude Code? Most teams reach for code by default. That default is wrong for a lot of the work.

Pick code and you’re patching a compiled agent whenever the world shifts. Pick a runbook and you edit a markdown file in five minutes.

This post defines the two patterns, lays out when to use each, and shows how to have Claude Code write a runbook for you so you don’t start from scratch.

What’s an Agent SDK flow

An Agent SDK flow is a compiled agent. You write code that wraps an LLM with a specific job. A narrow tool set. A structured loop with a maximum number of turns. A fixed system prompt describing the role and the rules.

The Claude Agent SDK, formerly the Claude Code SDK, gives you the Python or TypeScript primitives to build one. LangChain and LangGraph give you a different flavor. OpenAI’s Agents SDK gives you a third. They look roughly the same at the call site. You initialize a client. You hand it a tool set. You run a loop. You get a result.

Think of a job that reads inbound customer emails, classifies each one into a category, and writes a row to a database. The code is a few hundred lines of Python wrapped around fifteen lines of prompt. It runs on cron or on a queue. It does the same thing every time. That’s a flow.

Flows are what you reach for when you know exactly what the work is and you want it to run without a human in the loop.

What’s a runbook

A runbook is a markdown document you hand to an AI coding tool, most commonly Claude Code, as a procedure. The agent reads the markdown and executes the steps against a wide-open tool surface. The file system, web search, APIs, shell commands, whatever Claude Code has access to in that session.

Structure is usually phases with prose explanation, decisions the agent might need to make, and places where a human should weigh in. Skills and slash commands in Claude Code are a formalized version of the same pattern. A folder of markdown the CLI can invoke with one line.

Think of a weekly opportunity scan. Claude Code reads the runbook, pulls in fresh context, queries external systems for relevant opportunities, scores what it finds, and writes up a report. The runbook tells Claude what to do and how to decide, but not exactly which queries to run or which paths to follow. Those get determined fresh every run.

Runbooks are what you reach for when the work is stable enough to write down but dynamic enough that no two runs look the same.

The wrong default

The common instinct is to treat the runbook as a draft. Get it working, then port it to code. Flows feel like production. Runbooks feel like scratch paper.

That instinct is wrong for most of the interesting work.

Agent SDK flows are brittle on the kind of dynamic, complex work that runbooks handle cleanly. For a whole class of tasks teams keep trying to force into flows, the runbook is the end state. It never graduates.

Runbook-first means the runbook stays the runbook. Some workflows never want to be code.

Where Agent SDK flows break

A flow encodes a plan. Narrow tool set. Structured loop. Fixed prompt. When the plan holds, flows run fast and cheap. When the plan doesn’t hold, they break.

Real work has plans that don’t hold.

The tool set you picked last month doesn’t cover the fix you need this month. The loop structure assumes three pivots and the job actually needs seven. The prompt knows about the error modes you’ve seen. It doesn’t know about the one that showed up Tuesday. Patching one of these is easy. Patching faster than the world changes is not.

The runbook version of the same workflow doesn’t have this problem. The procedure lives in markdown. Claude Code reads the procedure and executes it against a wide-open tool surface. When the work breaks in a new way, you edit the markdown and the fix ships that afternoon. No rebuild. No redeploy. No refactor.

When an Agent SDK flow keeps going wrong, the problem is usually the category, not the code. Flows assume you know what the work looks like. Runbooks assume you’re still figuring it out.

Anthropic’s own engineering team wrote the clean version of this in their multi-agent research post: “you can’t hardcode a fixed path for exploring complex topics, as the process is inherently dynamic and path-dependent.” More work falls in that category than people think. Their earlier “Building effective agents” post makes the stronger version of the point: “find the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all.” The framework is a tool for a specific shape of problem, not a default.

When the runbook is the product

Runbook-shaped workflows share a handful of features.

Web search or external enrichment is part of the work. You pull in a page, a feed, a post, a ticket mid-run and react to what’s there. The runbook reads the live world on every run.

Mid-run pivots based on what you just found. Step five depends on what step four uncovered. You can’t pre-plan the branches because you don’t know yet which branch you’ll need.

The surrounding vocabulary drifts. Data sources evolve. Terminology moves. APIs quietly change their response format. A flow that froze its assumptions months ago is working off a stale map today. A runbook re-derives context every run.

A judgment step lives in the loop. Maybe not every run. Maybe one in ten. When the procedure hits an unexpected case, a human sees it and decides. A runbook pauses naturally. A flow either automates past the judgment or crashes on it.

The output is qualitative. A findings doc. A scoping brief. An enriched list with notes. Not a row in a database.

Anything with most of those features is a runbook forever. It doesn’t want to graduate.

A runbook that keeps getting edited is doing its job. That’s the signal it’s in the right format.

The category is bigger than most builders realize. Production investigations. Opportunity discovery where the vocabulary shifts. Sales discovery. Scoping and assessment work. Prospect research that has to pull live signals. Most work a human would call judgment work with some mechanical steps around it.

Anthropic themselves shipped Agent Skills as a first-class primitive: markdown folders with a little YAML on top. Simon Willison called it “closer to the spirit of LLMs” than the protocol-and-framework world people had been building against. The industry is drifting this way because the work is drifting this way.

When Agent SDK still wins

Agent SDK flows are the right call when the workflow meets the opposite criteria.

Batch or scheduled inputs. Emails arriving on a queue. Events on a webhook. Nightly data pulls.
Fixed input schema. You know what shape the data has and you control it.
One-shot work per call. Classify this email. Extract these fields. Transform this record. Small max_turns, deterministic tool set.
High throughput. You’re reusing a persistent client hundreds of times a day and latency matters.
Fail-safe defaults. A bad output is a “couldn’t determine” flag that keeps the line moving, not a crash.
The output is structured. A row. A JSON blob. A number. Not a document.

The obvious fits are the jobs your cron already runs. Ingest pipelines. Email classifiers. Security scanners. Content transforms with a fixed schema. Scheduled artifact generators. Anything where the workflow earned the right to be automated by being boring a hundred times in a row.

The signal that a runbook is ready to graduate is that you haven’t touched it in six weeks because there’s nothing to tune. Every run produces the same thing. Every output passes the same check. At that point porting to Python is about freeing you from babysitting, not about making the workflow smarter.

Compile the boring ones. Leave the interesting ones in markdown.

How to have Claude Code build your runbook

You don’t write runbooks from scratch. You get Claude Code to write them with you.

The cycle is three steps.

Step 1. Do the work once with Claude Code in the loop. Open a session. Tell Claude the goal. Go through the task together, with Claude executing steps and you correcting course. Keep it conversational. Don’t optimize for a pretty final output. Optimize for Claude seeing how you actually think about the problem, where you push back on its first instinct, what signals you use to decide what’s next.

Step 2. Ask Claude to write the runbook from the session. At the end, ask Claude to produce a runbook another instance of itself could execute without you. Something like: “Write this as a runbook. Break it into phases. Put the judgment calls as explicit questions the next run should ask. Don’t gloss over the places where I corrected you. Those are the parts that matter.” The first draft is usually close. One or two rounds of tightening and it’s ready to run.

Step 3. Run the runbook in a fresh session. Watch what happens. Edit. Open a new Claude Code window. Feed it the runbook. Ask it to execute. Every time the fresh run surprises you, decide whether the runbook needs more detail or whether that’s a judgment call you want to keep for yourself. After three or four clean runs, the runbook is doing most of the work and you’re tuning the edges.

The whole cycle is a day or two for most workflows. An afternoon if you’re focused. What you end up with is a document you can keep editing forever as the work evolves, instead of a frozen Python file that goes stale the week after you deploy it.

Two things worth adding once the runbook is running.

Keep a reference file next to it with example outputs from good runs. When a run produces something that looks wrong, you have a comparison. A plain markdown file of inputs and expected outputs works. If you’re already instrumented with Langfuse or similar, a dataset works better.

Give the runbook a native invocation. In Claude Code that means wrapping it as a skill or a slash command so you can trigger it with one line instead of pasting the procedure every time. The ceremony drops to near zero and the runbook gets used more because it’s easier to run. That’s a virtuous loop. More runs means more chances to spot where the runbook needs sharpening.

Start in markdown

Start every new agentic workflow as a runbook. Graduate only the ones that go boring. Leave everything else in markdown, and keep editing.

If you’ve got an Agent SDK flow that keeps breaking in new ways, that isn’t a code problem. That’s the category telling you something. Roll it back to a runbook and let Claude run the current version of the work instead of a compiled version of last month’s assumptions.