Stop planning, start probing and evolving
How architectural probes and evolution promote human-agent alignment and ease context management.
Over the past two years, I have tried to understand where LLMs and agents actually help in software development.
I used them since the early ChatGPT snippet days and kept going through modern agent harnesses, but I struggled to find a workflow where agents saved more time than I spent reviewing their failures.
Eventually, I found three forces that keep fighting each other:
Context management. The more context you load, the easier it is for agents to get confused. In my previous Agentic Engineering post, I argued that managing context has become the main jobs of humans when working with coding agents.
Human Review. Human effort has not disappeared. It moved from writing code to reviewing code written by agents. But agents are merchants of complexity, as Mario Zechner put it, so reviewing their output is rarely quick. And adversarial reviews help, but the final review still takes human time.
Waterfall risk. Agent output depends heavily on the prompt. That creates pressure to predict every detail upfront, before you actually understand the problem. The problem is that you usually discover your prediction was wrong halfway through implementation. If you are not working iteratively, you might discover it only after shipping.
Plans and Specs
Planning and specification-driven development are usually the first tools people reach for with agents: write a detailed plan, let the agent follow it, and make the result easier to review because it should already match.
That works when the plan is grounded. It works much less when it is mostly prediction.
Plans are useful. Requirements matter. But writing one plan for an entire piece of software, end to end, is like paving the road before you know where you are going.
At this point you might be thinking: “That’s exactly why you have specs, to know where you are trying to go”
I have a hard time deciding where to go for lunch, so good luck knowing exactly what software you want before exploring it.
User interactions emerge when you play with the software and break your assumptions. Features emerge when users complain that it does everything except what they actually need.
That famous “User Experience” vs “Design” meme was never really about design vs user experience for me. It was about guessing wrong where users wanted to go.
Likewise, when you ask an agent to plan a piece of software, you often get the best software development fan fiction you have ever seen. I call it Architectural Fan Fiction.
When I asked my agent to make a plan for a command line phonebook application it wrote this
### Phase 1: Project Setup
3. Create requirements.txt with dependencies
### Phase 2: Core Data Model
3. Create PhoneBook class with in-memory storage
4. Implement add/delete/edit/list operations with full error handling
### Phase 3: Persistence Layer
2. Implement FileStorage class with atomic writes
### Phase 4: CLI Interface
1. Set up argparse with all subcommands
2. Add input validation for all arguments
3. Implement colorized output
4. Add interactive mode with prompts
5. Support batch operations (import/export CSV)The plan looks reasonable, but it is already making several speculative decisions:
It chooses dependencies before the rest of the software has surfaced what it needs.
It invents a PhoneBook class, a FileStorage class, and argparse commands, but it does not show why they exist or how they collaborate.
It adds colorized output, interactive mode, and batch import/export, none of which I asked for.
The problem is not that the plan is bad. The problem is that it looks precise while still being mostly assumptions.
Architectural Probes and Evolutions
Instead of planning in detail when you do not yet know what the software should look like, why not probe what it might become?
Not by building an MVP. An MVP is already real software. A probe is one step before that.
A probe is intentionally fake: stubs, placeholders, no real storage, no real business logic, no polished implementation pretending to be production code. It may pass tests because it returns 1 and the test says assert X() == 1, not because it knows how to do the math.
The goal is not to prove that the feature works. The goal is to reveal the shape of the system before the real implementation starts.
A probe previews which components need to collaborate, what their responsibilities are, how they interact, and how they need to evolve before the system can satisfy the requirements.
An evolution is not just an implementation task or a TODO item.
It is a constrained architectural transformation attached to a collaboration point in the probe.
Instead of “implement feature X somewhere”, an evolution says: grow this part of the system in this direction, under these constraints, while preserving the current collaboration model.
This is what I got from my agent when I asked it to build an architectural probe for the Phonebook application using the probe-idea skill, without giving it any knowledge of the previous plan:
Compared to the plan, the probe makes two things obvious:
The plan mentioned
Phonebook,FileStorageandargparse. The probe shows how they interact: Phonebook owns the logic, FileStorage handles persistence, and the CLI invokes Phonebook methods.The evolutions also make behavior explicit: add requires unique phone numbers, delete is idempotent. That is clearer than “implement add/delete/edit/list operations with full error handling”.
The full list of evolutions becomes the implementation plan.
To make that list visible, I built a small tool called probedev, which extracts evolutions from the codebase::
% probedev list --short
models.py
next EVO-020: Return list of contacts for display.
EVO-030: Implement add with duplicate checking.
EVO-040: Implement deletion with idempotent behavior.
EVO-050: Implement update logic.
phonebook.py
EVO-070: Add country code to name lookup table and extraction.
EVO-080: Implement grouped display using COUNTRY_CODES.
...Now the plan is not floating in a document anymore. It is anchored to the probe.
If something looks wrong, I can ask the agent to change the probe and update the evolutions until the structure feels right.
Once the structure feels right, I can delegate implementation one evolution at a time. In expanded mode, probedev points the agent to the exact file and line where each evolution belongs:
% probedev list
...
phonebook.py
EVO-070 Add country code to name lookup table and extraction.
./phonebook.py:47
Why: Need to map country code prefixes (e.g., +1, +44) to country names
for grouping contacts by country in the list command.
Done: COUNTRY_CODES dictionary exists with all common prefix mappings,
extract_country_code() parses numbers correctly.
Non-Goals: Do not add phone number validation or formatting in this step.
EVO-080 Implement grouped display using COUNTRY_CODES.
./phonebook.py:60
Why: Core feature - list command must show contacts organized by country.
Done: Contacts are fetched from PhoneBook, grouped by country code,
and displayed with country name headers.
Non-Goals: Do not add sorting, filtering, or pagination in this step.Why it matters
Effective context management
Evolution markers live close to the code they are supposed to change.
The agent does not have to load the whole codebase, reconstruct the architecture, read a separate plan, and guess where the next step applies. The work is already anchored.
Instead of saying “implement contact listing” in a separate plan, the probe says: this method will evolve, this is why it exists, this is what done means, and this is out of scope.
Better task boundaries
A good evolution marker is a small contract.
It says what needs to change, why it needs to change, when the change is complete, and what the agent should not touch.
That gives the agent a smaller and safer working area. A simple feature is less likely to become a broad rewrite, and fake parts are less likely to be “improved” too early.
Non-goals are not documentation noise. They are guardrails.
Promotes encapsulation
When evolutions are attached to specific entities, methods, or collaboration points, the agent is pushed to implement the feature where it belongs.
If adding a contact is an evolution of PhoneBook.add, the agent has a clear entry point. It does not need to invent a new service, bypass the storage layer, or spread behavior across unrelated files.
Architecture survives implementation
Agents can satisfy the immediate request while slowly destroying the intended structure of the system.
The probe makes that structure visible before implementation starts. Evolution markers then keep each step inside it.
The agent is not just asked to “make the test pass”. It is asked to preserve the collaboration model exposed by the probe.
Easier human review
It is easier to review code than to review a speculative plan.
A plan asks you to imagine whether a future implementation will make sense. A probe lets you look at the shape of the system directly: classes, boundaries, dependencies, names, and direction of growth.
You are no longer asking “does this plan sound reasonable?”. You are asking “is this the kind of software we want to build?”.
Less planning theater
Detailed plans often look more precise than they are.
They create the impression that uncertainty has been removed, when it has only moved into a document. The hard questions are still there, you just find them later during implementation.
A probe does the opposite. It exposes uncertainty early. If the architecture feels awkward when everything is fake, it will probably feel worse when the real logic arrives.
Probedev
To try Probe Driven Development, start from the probedev repository:
the Probe Driven Development definition which explains what an architectural probe is and defines the glossary
the probe-idea skill, which teaches an agent how to create an architectural probe from an idea
the probedev CLI, which extracts and manages evolutions as the probe grows into real software




