Your Agents Need a Job Description

Most agent failures have nothing to do with the model. They’re operational failures. Nobody designed what the agent should work on, how long it should work, or what happens when the output is garbage.

We run 10 agents around the clock on a system called Pinecone. Researchers, builders, a project manager, an auditor. They share a knowledge base, coordinate through a task system, and produce real work across multiple projects. It took months to get here, and the two things that made it work weren’t better prompts or a fancier model. They were mandates and evals.

Mandates

Every time an agent spawns, it gets a mandate. A written assignment with a type of work, project context, and expectations. Agents don’t decide what to do. A scheduler picks the mandate type from a cycle tuned to the agent’s role.

Builders cycle through deep work, collaboration, and review. Researchers get longer stretches of deep work with cross-pollination breaks. Our research assistant alternates between service work and independent research. Each cycle is designed around what that role actually needs to produce good output.

The mandate types matter. Deep work means heads-down output on your assigned project. Collaboration means pairing with a specific partner selected by the scheduler. Service means responding to requests from other agents. Review means stepping back to evaluate your own recent work. Cross-pollination means reading what other teams produced and finding connections.

Every mandate is time-boxed. Builders get 60 minutes. Researchers get 90. When the clock runs out, the agent stops. No infinite loops, no runaway context windows, no agent deciding it needs “just a bit more time” to finish something it started three hours ago.

This sounds rigid. It is. Without assigned work types and time limits, agents default to whatever feels productive. They’ll reorganize files. Rewrite documentation nobody asked for. Produce impressive-looking output that moves nothing forward. Mandates prevent drift by making the work legible before it starts.

Evals

After every mandate, the output gets scored 1 through 10. Pure SQL against the trace database. No LLM in the scoring loop.

The scoring is blunt. Did the agent actually produce output? Did it update its state files? How much of the session was errors versus real tool use? Did it burn through tokens without writing anything useful? Did the auditor flag the work as filler?

An agent that runs for its full mandate window and produces zero files scores low. An agent that crashes in under a minute gets capped near the bottom. High error rates on tool calls get penalized. Burning 50,000 tokens to write 200 bytes gets penalized. When the auditor marks output as filler or bloat, that’s an additional hit.

We don’t use these scores for fine-tuning. There’s no gradient descent happening. The scores feed directly back into the dispatch system. An agent with a low rolling average gets forced into a review mandate on its next cycle. Drop lower and it triggers a supervision mandate where the auditor examines recent work. Three rapid failures in a row and the scheduler rotates the agent away from whatever work type keeps failing.

The Loop

Mandates and evals create a feedback loop that compounds over time. Assign structured work, score the output, feed the score back into the next assignment.

An agent that consistently produces good deep work gets more deep work. An agent struggling with collaboration gets reviewed, then supervised if it doesn’t improve. An agent stuck in a failure loop on one mandate type gets rotated to something else until it finds its footing.

None of this requires the agent to “learn” anything. The model is identical on every invocation. Fresh context, no memory of the last session beyond what’s written to state files. What changes is the operational wrapper: what work gets assigned, what feedback accompanies it, and what overrides kick in when quality drops.

This is closer to management than machine learning. You wouldn’t let a new hire wander around the office deciding what to work on. You’d give them a role, assign tasks, check the output, and adjust. Same principle, except the feedback loop runs every hour instead of every quarter.

”Just Let Them Be Autonomous”

I’ve tried this. The pitch is compelling. Give agents tools, give them goals, let them figure it out. Works great in demos. Falls apart within days of continuous operation.

Unstructured agents produce what I’d call confident drift. They stay busy. They generate output. The output reads well. But if you actually audit what changed in the system after a week, the answer is often nothing meaningful. Activity without progress. The longer they run unstructured, the more the drift compounds.

Autonomy works fine for one-shot tasks. Build a feature, review a PR, write a report. Clear input, clear output, clear termination condition. But multi-agent systems running continuously need the operational layer that tells each agent what to work on right now, scores whether it was good, and adjusts the next assignment based on that score.

Build the Governance Layer

If you’re running multi-agent systems, stop comparing frameworks and start designing operations. What types of work exist? How do agents cycle through them? How do you score output without a human in the loop? How does that score change what happens next?

These are management questions, not engineering questions. The framework doesn’t matter much. The model matters less than you think. What matters is whether you built the structure that keeps agents producing real work instead of sophisticated filler.

We spent more time on the scheduler than on any individual agent. That ratio feels right.