February 27, 2026

Your Agent's Personality Is a Search Problem

Stop hand-tuning agent personalities. Define the trait space, run variants, breed the winners. Intuition doesn't scale.

We spent months hand-tuning agent personalities. Adjust the confidence level. Rewrite the voice description. Tweak how much skepticism to inject. Test it, read the outputs, tweak again. The classic loop: change one thing, evaluate, repeat.

It’s the same trap as hand-tuning hyperparameters. You find something that works okay, you convince yourself it’s good enough, and you stop searching. You’ve found a local optimum and you don’t even know it.

Personality Is a Parameter Space

Once you see it, you can’t unsee it. An agent’s personality isn’t a single thing you write in a system prompt. It’s a point in a high-dimensional space:

  • Voice — sardonic, clinical, warm, provocative
  • Epistemology — how it evaluates evidence, what it treats as proof
  • Confidence calibration — how bold its claims are, how much it hedges
  • Exploration strategy — does it go deep on one thread or wide across many
  • Skepticism level — how aggressively it pushes back on inputs
  • Output structure — dense analysis, bullet points, narrative storytelling

Each of these is an independent axis. You can have a warm voice with aggressive skepticism. You can have clinical epistemology with narrative output structure. The combinations are enormous, and your intuition about which combinations work is almost certainly wrong.

We know this because we tested it.

Evolution, Not Iteration

We built a system that treats personality design the way you’d treat any optimization problem in a large search space: with evolution.

The setup: define trait dimensions as genes. Each gene has a pool of possible values — different voices, different reasoning frameworks, different confidence profiles. Create a population of variants, each with a random combination of traits. Run them all in shadow mode on the same inputs, so you can compare outputs directly without affecting production.

Score the outputs. Breed the winners — take the traits from high-performing variants and combine them. Mutate a few genes randomly to explore new territory. Run the next generation. Repeat.

This is a genetic algorithm. Nothing exotic. The same approach people use for architecture search, hyperparameter optimization, game AI. We’re just applying it to the part of the agent that everyone else hand-crafts and calls done.

What Surprised Us

The winning personalities weren’t ones we would have designed. Full stop.

Combinations we’d never have tried — pairing a voice style we considered too aggressive with a reasoning framework we considered too conservative — produced outputs that were sharper, more engaging, and more useful than our hand-crafted baseline.

This makes sense in retrospect. Human intuition about personality is shaped by human experience. What makes a person effective at analysis isn’t the same as what makes an LLM effective at analysis. The model’s relationship to confidence, skepticism, and voice is fundamentally different from ours. We were projecting human personality dynamics onto a system that doesn’t share them.

The search found combinations that work for the model, not combinations that would work for a human. That’s the entire point of searching instead of designing.

Shadow Mode Is the Key

You can’t do this in production. Running 10 personality variants on real tasks with real users would be chaos. Shadow mode is what makes it practical.

Every variant processes the same inputs as the production agent. The outputs get stored and scored but never shipped. You’re running a parallel universe where different personalities compete on the same problems, and you can compare them directly.

The scoring can be automated (output quality metrics, engagement proxies) or manual (read a batch, pick the ones you’d actually want to ship). We use a manual selection round — we call it judgement day — where we read the top variants’ outputs side by side and pick winners. Then breed them, mutate, and run another generation.

It’s slower than automated scoring but catches things metrics miss. A variant might score well on surface metrics but feel wrong in ways that are hard to quantify. Human judgement in the loop keeps the search grounded.

The Trait Space Matters More Than the Algorithm

The genetic algorithm is simple. The hard part is defining the right trait dimensions.

If your trait space is too narrow — just voice and tone, say — you’re only searching a small corner of the possible space. You’ll find the best voice/tone combo but miss that reasoning framework was the lever that actually mattered.

If your trait space is too broad — every conceivable dimension of behavior — you’ll need thousands of generations to converge on anything useful. The search space explodes.

We landed on about a dozen dimensions that cover the axes where personality actually affects output quality. Voice, epistemology, confidence, exploration, skepticism, output structure, plus a few continuous parameters (how aggressive on predictions, how much to weight certain topics, how contrarian to be). Enough to explore meaningfully, constrained enough to converge.

Getting these dimensions right required understanding what actually varies between good and bad outputs. That’s domain knowledge you have to bring — the algorithm can’t discover the dimensions for you, only search within them.

Stop Hand-Tuning

If you’re running agents in production and you designed their personalities by hand, you’re sitting on a local optimum. You might be close to the best possible personality for your use case. You’re probably not.

The move is straightforward:

  1. Define your trait dimensions. What are the independent axes of personality that matter for your agents’ jobs?
  2. Create a population. 5-10 variants with different trait combinations.
  3. Run them in shadow mode on real inputs.
  4. Score and select. Automated metrics plus human judgement.
  5. Breed and mutate. Combine winning traits, introduce randomness.
  6. Repeat.

You don’t need a sophisticated framework. A spreadsheet of trait combinations and a script to inject them into system prompts will get you started. The insight isn’t in the tooling — it’s in accepting that personality design is a search problem, not a design problem, and acting accordingly.

Your intuition about what makes a good agent is probably wrong in ways you can’t see until you test the alternatives. Run the search.

Your Agent's Personality Is a Search Problem
0:00
0:00