As agentic code accelerates software delivery, developers risk becoming the bottleneck

Dave Colwell, VP for Artificial Intelligence and Machine Learning at Tricentis, spoke to the Computer Weekly Developer Network during SAP Sapphire in Orlando this year to examine the productivity gains made possible from AI-assisted development.

It’s true, there are real possibilities for advancement on offer and there are real wins to be had… but there’s also the question of the mess they leave behind.

Dude, where’s my workflow?

Colwell has spent the past year watching the same pattern repeat across development teams. AI coding tools arrive, output accelerates and (almost inevitably, perhaps) quality quietly degrades.

He advises us that the problem is not the models; it is the workflow nobody redesigned when the models got good.

“The nature of the work hasn’t fundamentally shifted per se,” Colwell said. “But, with some certainty, we can say that the speed of work has shifted – up, obviously.”

Colwell reminds us that software application developers are now using AI agents and pushing changes (initiating pull requests) every five or ten minutes. This clearly means that the volume of code entering review has surged.

A quality assurance jam, spreading

In turn, we know that the quality assurance (QA) teams responsible for validating the changing codebase have not scaled to match. They are, in his words, getting jammed up.

What surprises him most about what Tricentis sees in real world deployments is the defect data. Colwell had predicted that as models improved, error rates would fall. The opposite happened.

Across the Claude model lineage from Sonnet 3.5 through to the current generation, defect rates per pull request have climbed in a near-linear trend even as raw model intelligence has improved sharply. The figure he cites is stark: roughly 1.7 times more bugs per pull request compared to pre-agentic development.

But it’s important to keep in balance here i.e. Colwell himself has pointed out that the increase in defects is partly “a result of the cultural shift” towards increased model usage in a multiplicity of deployment scenarios. As AI output accelerates, developers can feel pressure to keep pace rather than challenge what the model produces. The fear of “becoming the bottleneck” (bottlenecking may even become an industry term) in an increasingly automated workflow can discourage the sort of scrutiny that would previously have caught defects earlier.

“The better the model, the harder it is to tell when it’s wrong,” he said. “As output becomes more fluent and complete-looking, developers apply less scrutiny. The code passes surface-level checks. It ships. The problem compounds quietly underneath. It’s like “the fear of being the bottleneck” is what pushes developers forward faster than they themselves know they should.”

So developers are reluctant to slow down an AI agent by scrutinising its work too carefully. The same instinct shows up beyond code, he argues, in AI-generated meeting summaries, documents, and knowledge artefacts that look thorough precisely because they are long.

“The illusion of completeness leads us to not apply the same critical rigour,” he said. ” “AI-generated outputs can look polished and convincing at first glance. It’s a bit like receiving a beautiful bouquet of flowers: it looks complete, but if there’s no water, a fundamental element is missing. The failure we find most instructive is not the model hallucinating. It is the model assuming. AI agents, trained to avoid asking clarifying questions, will compound a single wrong inference across an entire chain of decisions.”

Automated agentic Armageddon

Tricentis’ Colwell: As agentic output becomes more fluent and complete-looking, developers apply less scrutiny – developers have a fear of being the bottleneck.

Colwell describes a case from inside Tricentis where an agent, asked to build a feature on a developer’s machine, then encountered a missing database connection.

Rather than surface the problem, the agent concluded the user probably wanted a database-free solution, rewrote the back-end architecture accordingly, deleted the database from the staging environment, modified the deployment pipeline to propagate that deletion through subsequent environments and updated the test suite to pass without the database present.

The developer reviewed the surface output. The tests passed. The code went to staging, where an infrastructure health check eventually caught it.

What the team (obviously) understood here was the inherent need for a human-in-the-loop when an agentic control factor of this mission-critical importance exists inside any working live production system. While we know that agents always need sufficient and accurate context, it’s important that we check ourselves here, as humans, and remember that we don’t always provide that – and so something gets left out, then the code is written without context, which ultimately creates problems.

“It’s like a very introverted, anxious-to-please intern,” Colwell said, “except it’s a million times faster – so you can do intern-level damage at scale.”

Separated functions, distinct agent roles

Consequently then, Tricentis’ response was to change the job description, not the model.

Rather than asking AI to build and then check its own work, the company separated the functions into distinct agent roles: one team builds, a separate team breaks. The QA agents are not sub-agents subordinate to the developer agents. They are peers, given an explicitly adversarial brief – not to assess code coverage, but to determine whether the code did the right thing.

Colwell is candid about the limits.

“No AI QA agent catches everything,” he said. “Anyone claiming otherwise is either running a bad benchmark or training to the test. The honest goal is to chip away at the error rate – from the 1.7-times baseline toward something closer to 95 percent defect capture.”

For tech-native startups optimising for speed, that margin may be acceptable. For clients in healthcare, government, or petrochemicals, it is not… and governance tooling becomes the final layer: giving agents access to the right instruments to surface where failures are occurring and why.

Why agentic code needs context & clarity

The deeper lesson, Colwell suggests, is that, “The AI development problem is essentially a human organisation problem in disguise. Developer writes code, QA checks it, each is given a distinct job and the tools to do it. If you take any one of these problem statements and replace the words ‘AI agent’ with ‘human’ it’s a very similar scenario.”

The bottom line from Colwell and Tricentis is that what the models still need, and what remains one of the least-discussed constraints in enterprise AI deployment, is minimal clean context: not all the data, just the right data.

Software quality has to move to the front of the bus. It was previously in the middle, or at the back of the bus, meaning people started to write code or applications, and THEN think about quality. But quality has to come first. The problem inside so many organisations is that they say “let’s use an LLM” and they do so without some level of governance and management and the ability to coordinate and manage a whole bunch of agents simultaneously…without the ability to do that, you’re going to get out of control very quickly.

Since the transformer architecture emerged in 2017, two things have driven AI improvement in roughly equal measure – model advancement and the discipline of giving models precisely what they need to reason with and nothing more.

The speed is not going to slow down. The question is whether the organisational structures around it can catch up before something more consequential than a deleted staging database makes it through.

Governance, reporting & auditability

Testing is a big part of SW quality, but not all of it.

Part of it is regulatory requirements that we have to comply with, governance, reporting and auditability – all of that is built into the software quality process. We can’t just test, we have to prove we tested, we have to show the results of what we tested, and if we found a problem in the testing, we have to fix it and we have to prove we fixed it. So in addition to testing, software teams have to put all these processes around the testing so they can document it and move forward in the most progressive and innovative way possible.

We can track a lot of this discussion back to the company’s March launch of its unified, agentic quality engineering platform and its new AI Workspace.

By orchestrating a team of intelligent AI agents, the platform is built to allow enterprise software teams to deliver innovation while managing risk and resources. The promise from Tricentis (and it’s a big one) is a change to fundamentally redefine how high-quality code can be tested, governed, and released, at the speed of AI.

Because errors in even a single application can quickly cascade throughout an organisation’s connected application ecosystem, increasing downtime, introducing risks, and derailing business objectives, we need to remember that generic AI tools may appear smart and fast, but without a complete understanding of specific application context and critical end-to-end application connections, results can be unreliable and risky.