The decay problem developers need to address in AI security

AI model releases aren’t step changes, says Alan LeFort, CEO, StrongestLayer. He thinks there are points on a predictable curve and wants to explain why software application developers working in security teams need to stop reacting and start building for the next threshold.

Anthropic’s Mythos model triggered the usual reaction. Slack threads, budget questions and a scramble to assess exposure. LeFort is sure that that response misses the point. Mythos does not change the game. It confirms the schedule. AI capabilities have been improving on a steady cadence and security teams are still treating each release like a surprise.

The pattern is clear.

“GPT-4 made scalable phishing real. DeepSeek-R1 delivered strong reasoning at far lower cost and showed up in attacker tooling within days. Now Mythos. Each moment feels like a step change, but it is not. It is a curve. Every nine to fourteen months, attackers get a meaningful upgrade. Defences lag because they are built on slower cycles. Procurement, validation, and deployment stretch over months or years,” said LeFort.

By the time systems go live, their assumptions are already outdated… so this is not a tooling gap – it is a mismatch in timelines.

Attackers without constraints

When a better model appears, attackers use it immediately. There is no approval chain or stability requirement. Defenders operate differently. Vendors select a model, train on known threats, validate and ship. Customers expect that system to last. The result is structural lag.

LeFort tells us that detection systems end up reasoning with outdated assumptions against threats that evolve in real time. Not every layer fails at the same rate. Content-based detection breaks first because the attacker’s output improves fastest. Behavioural signals hold up longer because people do not change how they work as quickly as models change how they write.

Most teams do not know which parts of their stack fall into which category. That is a problem in itself.

LeFort argues that better models make attacks more fluent. They do not make them grounded in reality. Attackers still infer how an organisation operates from public data and guesswork. They do not know approval flows, vendor relationships, or executive behaviour.

“That gap matters,” said LeFort. “A phishing email can read perfectly and still request something your company would never do. A fake invoice can look legitimate and still break every real payment pattern. The signal that exposes the attack does not live in the content. It lives in your systems. Identity, communication history, calendars, workflows.”

Attackers approximate those patterns so if a detection stack ignores that advantage, the argument here says that organisations are choosing to compete on the attacker’s terms.

What breaks at the next threshold?

Most teams ask whether their tools can detect the latest class of attack. That question is short-term. The next model will arrive within a year. What matters is what breaks then.

“Some parts of your stack will hold up. Others will reset. Content matching will continue to reset because it always has. Retraining helps, but it keeps you reactive. Behavioural signals last longer because underlying patterns persist. Organisational context lasts the longest because it operates on facts the attacker cannot access,” explained LeFort.

That durability comes with cost. Deep context requires integration into identity systems, workflows and records. It introduces governance and maintenance overhead. There is no clean solution. There are only tradeoffs, and those tradeoffs decay at different rates.

Security teams tend to chase better models because that is what vendors sell. That approach does not hold up. Models decay quickly. Signal access compounds.

“If your detection logic is mostly content-based, you are exposed. You are relying on the same surface the attacker controls. If your detection logic pulls from identity, behavior, and workflow context, you gain an advantage that improves over time. Those signals get richer as your organisation operates. They do not reset when a new model ships,” said LeFort.

This is a resource decision. Large enterprises can support deeper integrations and should. Smaller teams cannot and should focus on behavioural signals that deliver value without heavy maintenance. Either way, the goal is to invest where the decay curve is slowest.

Developers, what can you maintain?

Custom models trained on last year’s data will not hold up unless they are retrained constantly. Most teams cannot support that. Deep integrations require ongoing updates as organisations change. Every approach carries a maintenance cost.

LeFort: Mythos “feels like a step change” but says it’s not i.e. its another point on a curve that still has years left.

The question is not which approach is ideal. Which one fails more slowly under real constraints? Most teams never ask that question and end up overinvesting in layers that degrade quickly.

“You need to know why something was flagged. Not a score or a label. Actual reasoning. Which signals triggered the alert. What context the system used. What inconsistency it identified,” said LeFort. “If your vendor cannot explain that clearly, you cannot evaluate whether their system will survive the next capability jump. Transparency is not optional. It is how you separate real architecture from marketing.

He notes that Mythos “feels like a step change” but says it’s not i.e. its another point on a curve that still has years left. Teams that keep reacting to each release will stay behind. Teams that adapt will change how they build. They will design systems that hold up as models improve. They will invest in signals attackers cannot access. They will choose architectures based on how they age, not how they perform today.

The advice for developers is, the curve does not slow down for anybody’s roadmap… so build accordingly.

CEO Deep Dive

Computer Weekly Developer Network (CWDN held an extended deep dive Q&A with LeFort.

CWDN: What does the growing lag between attacker capability and defender tooling mean for the software delivery lifecycle?

LeFort: It points to something concrete. Once finding a vulnerability drops from months to days, annual pen-tests and compliance checkboxes stop keeping you safe. Three moves follow. First, match the clockspeed: move code review, testing and vulnerability checks from once-a-year events to a continuous habit.

Second, prioritise by risk, because you still cannot test everything at once.

Third, put AI to work on your side so you stay in sync as attackers speed up. And remember, your own code is not the only target. The libraries, packages and languages it is built on are attacked constantly, so even internal software is exposed through them. Test as often as you are tested.

CWDN: Should application security be redesigned around behavioural signals rather than content-based detection, and what does that require from developers?

LeFort: Not rather than. Both, matched to the problem. For known vulnerabilities, the ones tied to specific coding patterns, pattern matching is the right tool: fast, precise, cheap, and worth running continuously. What it cannot do is tell you whether software is being used as intended. That is the job of intent, or behavioural, signals: is this code still doing what it was built to do, and only that?

Intent catches exploitation in progress and changes that quietly shift a program’s purpose, including ones that arrive through a dependency. Defining normal sounds open-ended, so scope it: start with your two or three most sensitive services, baseline from logs you already keep, and treat a deviation you can explain as proof before expanding. That makes it a phased overlay, not a boil-the-ocean project. Pattern matching closes the gaps you can name; intent catches the misuse you could not.

CWDN: How do organisations build governance and traceability into software pipelines when the AI models underpinning their tools are constantly changing?

LeFort: You cannot pin the model inside a vendor’s black box, so pin everything around it, by contract and on your own side. In the contract, require model and version pinning, advance notice of changes, and the right to test before a swap reaches you.

On your side, log every decision with the model and version the vendor asserts, and keep that record where it survives the model being replaced. This rides your existing change-control process rather than standing up a parallel one. The decision has to outlive the model that made it. If a swap silently changes behaviour and you cannot trace it, that is hope, not governance.

CWDN: What responsibility do software vendors have to explain why their security tools flagged something, and how should buyers be evaluating that?

A score is not an explanation, and a scripted walkthrough is easy to rehearse. So pressure-test it. Bring your own example the vendor has never seen and make them explain it live.

Hand them a known false positive and ask why the system was wrong: a real architecture shows which signal misfired, a rehearsed one deflects. Then ask what signal it would have missed, and what it retrains on when it does. Vendors who understand their own system answer all three.

If the explanation only holds on the examples they chose, the architecture will not survive the next capability jump. Transparency is a durability indicator.