1.6 Billion Tokens on OpenClaw
Nine agents ran around the clock this week. Most produced garbage on Monday. By Friday, after four debug cycles, the architecture looked completely different.
Nine agents ran around the clock this week. Most of them produced garbage on Monday. By Friday, after four debug cycles, the architecture was unrecognizable.
Getting an agent to complete a single task takes an afternoon. Getting nine of them to run reliably for 16 hours straight takes weeks of iteration. We’re in those weeks right now.
The Debug Cycle
The agents run 24/7 on cron. They run whether they’re working or not. We pull the results, review everything with Opus 4.6, find the failure patterns, fix prompt structures, context management, tool use, and push the updates. Four or five of those review cycles this week, each one producing real architectural improvements. The agents never stop between cycles.
Most of the failures that matter don’t show up in short runs. They surface at hour eight, hour twelve, once context windows fill and the work gets complex enough to stress what you built. Short demos hide these problems. Long runs expose them. You need both the length and the volume. Length to find failures, volume to confirm your fixes actually hold.
Agents are non-deterministic. Give one the same input twice and you get different execution paths, different output quality. You can’t write a test suite and call it done. You have to watch them work, over hours, across different inputs, and build an understanding of where they’re solid and where they fall apart. That understanding only comes from runtime.
The Stack
We wrote our own agent runtime. Replaced OpenClaw with Pinecone, 800 lines of Python that does exactly what we need. We were using maybe 5% of that framework and spending more time debugging its abstractions than our own agents. Simpler stack, cleaner debugging, no attack surface from public skill registries nobody’s auditing.
Market Radar runs on it, scraping 22 subreddits, 24 YouTube channels, and 16 RSS feeds around the clock. Nine agents across two product tracks, all on local hardware. Two units drawing 120 watts, running MiniMax M2.5, an open model that beats Claude Sonnet on coding benchmarks and handles tool calling better than Opus. A year ago, open models weren’t close. Now they’re ahead on the capabilities that matter most for agents.
1.6 billion tokens this week. About $400 a week at API rates. Our hardware’s already paid for.
Why Local Changes How You Build
Pay-per-use APIs create a scarcity mindset. Every token costs money, so you ration experiments, question whether a run is worth it, kill agents early when they look like they’re going sideways. You optimize for spend instead of learning.
We front-loaded our costs with hardware. Two units, paid once. Now the only way to make that investment worthwhile is to push as many tokens through the system as possible. That flips the entire mindset. Instead of “is this experiment worth the tokens,” it’s “why aren’t we running more experiments right now.”
We run speculative workflows overnight. We try multi-agent architectures that burn hundreds of thousands of tokens before we know if they’re going anywhere. We let agents attempt things we’re not sure are possible, because the hardware sitting idle is the waste, not the tokens running through it. Electricity is negligible next to what we paid for the machines. The cost is sunk. Usage is how you extract value from it.
Over weeks of this, you develop a feel for what agents can handle that you can’t get any other way. You learn which workflows genuinely automate and which ones you assumed would but don’t. Some tasks that agents struggle with in two-hour tests they handle fine over eight hours once they have enough context. Other tasks that look easy in demos fall apart completely at scale. All of this comes from runtime, and runtime only comes cheap when you own the hardware.
None of our agents are productive yet. They run 24/7 anyway. Every cycle they’re measurably better, and they’re generating data and hitting failure modes whether we’re watching or not. We’re at about 60% capacity on this cluster. There’s room for 50 agents.
We’ll write about what happens when they start producing.