Your Agent Forgot Everything by Hour Eight

We run 10 agents around the clock, burning through 2 billion tokens a week. Most of the failures we’ve debugged didn’t happen in the first hour. They happened at hour eight, when the context window was packed with tool outputs, stale reasoning traces, and instructions the model stopped paying attention to three hours ago. Context management is the difference between agents that degrade and agents that run indefinitely.

Context Is Not Memory

Context windows are working memory. Every token you add dilutes attention on every other token. At 10K tokens, your agent follows instructions precisely. At 100K tokens, it starts missing things that were clear at 10K. At 180K, it’s ignoring half of what you gave it.

This is why short demos look great and production agents fall apart. The demo ran for 5 minutes with a clean context. Production ran for 8 hours and the context is a landfill of API responses, prior reasoning, and tool call artifacts. The model hasn’t changed. The signal-to-noise ratio in its context has.

Every provider, every model, same problem. Context is finite working memory, not storage.

The Obvious Fix That Doesn’t Work

Naive truncation drops early instructions and system context. Your agent loses its mandate. Summarization loses specifics that matter later: exact thresholds, edge cases, numeric values that drove a decision two hours ago. Sliding windows are better but still guess about what to keep.

All three approaches try to make long contexts work. The better move is to not have long contexts in the first place.

What Actually Works

Keep Contexts Short on Purpose

An 8-hour session should be forty 12-minute tasks, not one marathon conversation. Each task gets a fresh context with only the information it needs.

Our agents get mandates: discrete tasks with specific objectives and bounded scope. Execute the mandate, return results, terminate. The next mandate starts clean. No accumulated garbage from prior work.

This works with any API. Structure your agent as a task runner, not a conversation partner. Short-lived contexts don’t degrade because they don’t live long enough to degrade.

External Memory with remember() and recall()

Persistent knowledge doesn’t belong in context. Move it to retrieval.

We built a remember/recall system where agents decide what’s worth keeping. When an agent learns something useful mid-task (“this API returns paginated results capped at 100”), it calls remember(). The next task on a related topic gets that knowledge injected via recall(), matched by embedding similarity and keyword overlap.

Implementation: SQLite, 384-dim embeddings from all-MiniLM-L6-v2, frequency scoring so heavily-used memories surface first, weekly pruning of the bottom 30% by recall frequency. Private scopes per agent, shared scopes across the fleet.

This is about a week old for us. The pattern is proven, every RAG system works this way, but our agent-level implementation is fresh and we’re still tuning the pruning threshold. The infrastructure is minimal: SQLite and a small embedding model. No GPU cluster required.

Tiered Memory Consolidation

Raw data is valuable on day one, unmanageable on day thirty. Compress it in stages.

Our Market Radar system processes feeds from 22 subreddits, 24 YouTube channels, and 16 RSS feeds. That volume of raw intelligence would blow out any context window within hours. So we consolidate on a schedule:

Daily summaries: MiniMax M2.5 compresses each day’s raw feeds on ingest. Agents never see the firehose. Complete summaries kept short-term.
Weekly summaries: daily summaries roll up into weekly digests
Monthly summaries: weekly rolls into monthly, major themes only

Runs on cron. Each tier compresses further, so agents get dense signal no matter how far back they need to look. When an agent needs market context, it gets a layered injection: monthly themes for long-range patterns, weekly summaries for recent trends, daily summaries for this week. Progressively compressed history, all fitting in a manageable context.

Fully integrated into agent dispatch and running on schedule.

Generalizable version: a cron job and any LLM. Haiku-class models handle summarization fine.

Scope Tool Outputs

Most context bloat comes from dumping entire tool responses. A 5K token API response when you need 200 tokens of it. A full database query result when you need three fields from one row.

Parse the response, extract what matters, inject only that. This is the cheapest improvement on the list. No infrastructure, no new systems. Just stop feeding your agent raw firehose output.

Ten agents, 2 billion tokens a week. The agents at hour eight work exactly like the agents at hour one, because hour eight is actually task number forty running in a fresh context with relevant memory injected.