Agentic Engineering: A Better Paradigm for AI-Assisted Development

How unsupervised AI coding can slowly degrade a codebase and how Agentic Engineering turns AI into a force multiplier for software development.

Mar 06, 2026

For the past two years, I have been integrating coding agents into my development workflow: first GitHub Copilot, then OpenAI Codex and Claude CLIs. The more I worked with them, the more I started noticing patterns in how both the agents and the projects themselves evolved over time.

Eventually I reached a balance that actually increases my productivity instead of slowing me down. Interestingly, this improvement didn’t come from agents completing individual tasks faster, but from being able to parallelize work by offloading tasks to them.

Eventually I realized this was actually a good outcome.

I have been following Simon Willison’s blog for years, partly because of my long involvement with the Python web framework community as a maintainer of TurboGears2. Over time it became clear that much of his recent writing had shifted toward a new question: how do we actually make coding agents effective?

In October 2025, Simon published an article about Vibe Engineering. While I personally prefer the term agentic engineering, the core idea resonated strongly with my own experience.

There seemed to be a general consensus that the use of coding agents exists on a spectrum, and that the most effective engineers tend to operate on the opposite side of that spectrum from what people call “vibe coding”.

But I couldn’t clearly articulate why.

So, being an engineer, I did what engineers tend to do: I started collecting data.

For each pull request generated with the help of agents I tracked a few simple metrics:

How many lines of code the PR contained
How long it took to complete the PR using an agent
How satisfied I was with the final result (Unsatisfied, Indifferent, Satisfied)

As the data accumulated, a pattern started to emerge:

The quality of AI-assisted pull requests was influenced far more by the existing codebase than by the complexity of the feature itself.

Some codebases consistently produced a lot of slop.

Some codebases consistently led to results I was genuinely happy with.

And the more I looked at those patterns, the clearer it became that there was a structural reason behind them.

The Self-Reinforcing Loop of Vibe Coding

The problem is not that agents can’t write code. It’s that, when left unsupervised, they tend to inflate the codebase: more lines, more tiny abstractions, more branches, more glue. Over time, that verbosity turns into entanglement.

I saw this most clearly in “throwaway” projects, where I didn’t care about long term maintainability and I let the agent roam free. A predictable pattern emerged:

The agent ships a verbose implementation.
The next iteration gets harder because the agent has to re-parse and reason about its own growing output.
The agent starts contradicting itself about the architecture and adds even more scaffolding to patch those contradictions.
Eventually I step in, refactor to reduce moving pieces, and the next agent iteration improves immediately.
Left unsupervised long enough, the cycle repeats.

This matches what other people have measured and documented:

SonarSource’s analysis of multiple coding models found that some models systematically generate more verbose code and that this correlates with more maintainability and code-quality issues (The Coding Personalities of Leading LLMs, and the follow-up GPT-5 update).
“More tokens in, worse performance out” isn’t just intuition. Chroma’s Context Rot report shows performance degradation as the input context grows (Context Rot).
Anthropic frames this as a core engineering problem: context isn’t free, and systems need deliberate strategies to curate it (Effective context engineering for AI agents).
And in long-horizon, multi-turn software tasks, agents can become confused and hard to debug precisely because the interaction can stretch across dozens of turns (Demystifying LLM-based Software Engineering Agents, FSE 2025).

Even worse: if this unsupervised verbosity becomes common in public repos, it’s reasonable to worry about long term ecosystem effects, because models learn from what we publish.

A concrete example: I needed a get_packages function to collect dependencies of a software package stored in a compressed tar file. Claude’s first attempt was a classic “just add one more if” approach: two functions, multiple conditionals, and assumptions about paths.

def parse_requirements(content):
    names = set()
    for line in content.splitlines():
        line = line.strip()
        if not line:
            continue
        if line.startswith("#"):
            continue
        if line.startswith("-"):
            continue
        match = re.match(r"^([a-zA-Z0-9_.-]+)", line)
        if match:
            names.add(match.group(1).lower())
    return names

def get_packages():
    names = set()
    with tarfile.open(FILE, "r:gz") as tar:
        for req_path in ["requirements.txt", "./requirements.txt"]:
            try:
                f = tar.extractfile(tar.getmember(req_path))
                if f:
                    names |= parse_requirements(f.read().decode("utf-8"))
            except KeyError:
                continue
    return names

This implementation suffered from two issues:

It’s fragile toward the location of the requirements.txt file and toward the way the path is recorded in the tar archive.
It does not work with entries pointing to URLs.

But overall it reveals a pattern I see a lot from agents: conditional branches are the default escape hatch. Do you have a problem? Just add one more if. Each corner case meant one more if.

Syntax Complexity is Cheap, Context Complexity is Not

The version I ended up shipping, which you could argue looks more Python-y and less beginner-friendly, fixed both bugs and handled the corner cases without splitting the logic into multiple conditional branches.

def get_packages():
    with tarfile.open(FILE) as tar:
        # Get packages from any requirements.txt
        packages = itertools.chain.from_iterable(
            io.TextIOWrapper(tar.extractfile(fname) or io.BytesIO(b""))
            for fname in tar.getnames()
            if pathlib.PurePath(fname).name == "requirements.txt"
        )
        # Get only name of the package without version
        return {
            re.split(r"[<>=~!@;\[]", package_name, maxsplit=1)[0].strip()
            for l in packages
            if (package_name := l.strip().lower())[:1] not in ("", "#", "-")
        }

Behind the denser syntax there are only two steps to keep in mind:

Get all lines from any requirements.txt, regardless of where it lives inside the tar.
For each line, extract the package name while skipping comments and pip options.

This is the point where readability becomes tricky, because agents and humans pay different costs.

For humans, the cost is mostly syntax familiarity.
For agents, the cost is mostly context mass: how many tokens, branches, and intermediate artifacts they must keep active while reasoning about the workflow.

There’s evidence that context size itself can hurt model performance even when the relevant information is present: the EMNLP 2025 Findings paper Context Length Alone Hurts LLM Performance reports degradation tied to longer inputs, beyond just retrieval failure (Context Length Alone Hurts LLM Performance Despite…).

And in code generation specifically, recent work suggests that complexity metrics correlate with success: Sepidband et al. analyze standard complexity metrics and show which ones are predictive of LLM code correctness, then use them as feedback to improve outputs (Enhancing LLM-Based Code Generation with Complexity Metrics). In other words, the shape of the code matters to the model, not only the problem statement.

This also matches what we’re seeing at the ecosystem level: PRs are getting bigger as AI tools amplify output. Greptile’s internal data reports a 33% increase in median PR size (57 → 76 lines changed) over 2025 (The State of AI Coding 2025). Bigger PRs aren’t automatically worse, but they are more context for both humans and agents, and they raise the probability of “just add one more if”.

So when I say syntax complexity is cheap, I mean:

Using dense constructs (comprehensions, iterators, declarative helpers) can compress logic into fewer moving parts.
Fewer moving parts usually means fewer branches and fewer opportunities for the agent to get lost in its own output.

That’s why this version ended up being easier for agents to work with in later iterations: not because it’s shorter for its own sake, but because it’s more stable to reason about.

Encapsulation Predicts Agent Success

As discussed, agents get confused by long contexts, but they naturally tend to produce long code and thus pollute their contexts.

In practice, the failure mode is rarely “can the agent follow a function”. It’s “can the agent keep a consistent mental model of the system while jumping across files, layers, and abstractions”. The more modules, classes, and functions it must keep active, the easier it is for it to contradict itself about how the architecture works.

In the end we see a similar behavior in humans too: try to keep too many threads in your head at once and you quickly get lost.

This is also why graph-based approaches keep popping up in recent research: they try to replace a huge raw context with a smaller, structured map of the codebase. For example, LocAgent builds a dependency graph to help agents locate the right entities through multi-hop reasoning and reports significant improvements on real-world localization benchmarks (LocAgent, ACL 2025). Similar repository-graph ideas show up in repository-level code generation work, explicitly acknowledging that repo tasks break when the model can’t reliably recover structure and dependencies (Code Graph Model, OpenReview 2025).

Once you look at it through that lens, it becomes clear why cyclomatic complexity matters less here. Cyclomatic complexity measures local branching inside a unit. But what seems to dominate success with agents is how many entities collaborate to implement a workflow, and how cleanly those responsibilities are isolated.

That’s what I started calling Transitive Collaborator Count: for a given workflow, count the distinct modules and classes you must touch, directly or indirectly, to reason about it. The higher that number, the more likely the agent is to lose the plot.

This is not just about style. Maintainability-focused generation research is increasingly converging on the same direction: reduce coupling, increase cohesion, enforce clear responsibility boundaries. MaintainCoder (NeurIPS 2025) frames the problem explicitly as maintainable code under evolving requirements and pushes designs that reduce coupling and improve cohesion, because that’s where naive generation tends to break (MaintainCoder, NeurIPS 2025).

So the architectural conclusion is simple:

Fewer, well-encapsulated classes with clear vertical scopes give the agent a smaller, more stable mental model.
Highly decoupled, hyper-extensible designs can be great for human teams, but for agents they often inflate collaborator count, increase cross-file reasoning, and make understanding the system much harder than writing the next line of code.

If we want agents to stay useful across many iterations, we should optimize for responsibility boundaries and low transitive collaborator count, not just for reusability knobs.

Why High Level Languages Matter More in the Age of AI

In discussions about programming languages, low level vs high level is often framed as a tradeoff between control and productivity. That’s true, but in AI-assisted development there’s a different constraint that starts to dominate surprisingly quickly:

Context is scarce.

High level languages and high level APIs act as context compression. They let you express the same intent with fewer lines, fewer moving parts, and fewer opportunities for subtle, accidental divergence. When you work with coding agents over many iterations, that difference compounds.

Less text means less context decay

Every extra line generated today becomes part of tomorrow’s problem. It either ends up directly in the prompt, or it affects what retrieval pulls in, or it simply increases the surface area an agent must keep consistent. That’s why verbosity is not a cosmetic issue, it’s an operational cost you keep paying.

Recent work on context compression exists for a reason: stuffing more context is not a free win, and longer inputs can actively degrade effectiveness. Techniques like extractive compression are being explored specifically to keep only the most relevant pieces in the working set (EXIT: Context-Aware Extractive Compression).

You don’t need a custom RAG stack to benefit from the takeaway: the less text you need to carry around to preserve meaning, the more headroom you keep for reasoning.

Expressive primitives reduce degrees of freedom

There’s another effect that matters even more than concision: high level primitives constrain the solution space.

In low level code, there are many ways to represent the same intent. That flexibility is powerful for humans who want control, but for probabilistic generators it creates a large number of valid-looking paths that are subtly wrong. High level operations tend to have narrower semantics and more predictable composition, which keeps both humans and agents closer to the intended shape of the solution.

This is exactly why constrained languages and DSLs are being explored for reliable generation. One recent paper explores the same idea from a different angle, arguing that much of the brittleness of LLM-generated code comes from the flexibility of general-purpose languages, and shows large reliability gains by generating into a constrained DSL (Anka: A Domain-Specific Language for Reliable LLM Code Generation).

You don’t have to adopt a DSL to get the benefits. Using high level libraries, declarative APIs, and domain-specific constructs is often enough to reduce the number of almost-correct implementations an agent might produce.

Training-data proximity improves when you stay high level

When you build solutions out of well-known high level building blocks, you increase the chance that the model has seen very similar patterns in public code. Not the same project, but the same shape of solution.

High level code also tends to force more structure into the program itself. That usually reduces ambiguity and makes later iterations more reliable.

What this changes in practice

When someone says low level languages are better, they’re often optimizing for a small local view: fine-grained control, micro performance, minimal abstraction. That can be a perfectly reasonable choice for small codebases, tight kernels, or throwaway scripts.

But if you care about software that evolves through many AI-assisted iterations, the priorities flip. You want:

semantic density
predictable composition
fewer moving parts
less context per unit of progress

High level languages and high level APIs are one of the most effective ways to get there.

Agentic engineering or vibe coding?

None of this is an argument against AI. It’s an argument for treating AI like a real engineering tool with real constraints.

Vibe coding optimizes for speed today. Agentic engineering optimizes for software that keeps improving tomorrow.

And in that world, high level languages aren’t a luxury. They’re an amplifier for both human attention and agent attention.

Fieldnotes and Thoughts by Alessandro

Discussion about this post

Ready for more?