LLM series - Tricentis: Continuous testing for gen-AI systems

This is a guest post for the Computer Weekly Developer Network written by David Colwell in his capacity as VP of AI & Machine Learning (ML) at Tricentis.

For any app on any infrastructure, Tricentis says its AI-augmented quality engineering platform helps developers and teams test smarter, release faster and reduce costs along the way.

Colwell writes in full as follows…

Imagine a scenario in which you have a wristwatch that you have inherited; it’s a family heirloom of significant sentimental value.

You take it to a horologist every year for maintenance and to ensure that it is kept in good working order. One day after collecting the watch you are surprised to find that the watchmaker has tried to thank you for being such a loyal customer by replacing the strap, crown and face. The watch has technically been improved and has fewer defects – but you now value it less.

When building systems and applications whereby continuous learning is at the core, there is a similar paradox.

Regular updates are common in ChatGPT and PALM2, but the complex nature of neural networks means it is often impossible to quantify exactly what has changed.

What does an ‘improvement in precision while maintaining recall’ mean to your specific scenario? Do you care if the latest version is better at solving the hindsight benchmark questions? These questions are at the core of the problem with testing a system that is built around a learning system.

The challenge in testing LLMs

When using a system or API with known capabilities, it’s possible to document inputs and expected results.

You can make confident assertions about the results to expect because the system, while not necessarily stable, can be expected to be both deterministic and pre-knowable.

However, learning systems inherently break this paradigm.

This is not a bug; rather it’s a feature of their very nature. They are trained on vast datasets, but we do not expect them to get every answer right; we just expect them to improve over time. Sometimes an improvement in one area can come at the expense of another and that’s totally acceptable.

AI cannot be interrogated, blamed, or held to account; this is why it’s so important to acknowledge its limitations.

What about Generative AI?

Generative AI is no different.

Improvements in some areas can come at the cost of deteriorations in others, or sometimes the improvements may break production systems. For example, a recent paper by Lingjiao Chen et al. from Stanford University tracked ChatGPT’s ability to answer specific questions over time. The findings revealed that with improved training and exposure to humans, the LLMs significantly changed their answers to some simple mathematical questions.

When asked ‘Is 17077 a prime number?’ the GPT-4 network significantly degraded in its ability to reliably answer this category of question over a three-month period, whereas the GPT-3 network improved. Furthermore, when asked to generate simple code, GPT-4’s ability to generate purely executable code degraded significantly.

The reason for this was mostly that ChatGPT changed its technique for replying, but in both cases a system that relied on these learning models would be broken by these changes. Interestingly, degradation in the ability to create executable code was because ChatGPT was optimised for responding to humans and is now offering more commentary with its responses.

The paradox of knowability

This brings us to the crux of the problem: asking AI to solve problems without clearly defined solutions.

In some ways, this shows us that engineering is opposed to AI, given that an engineer’s job is understanding a problem then positing ways to solve it, whilst generative AI’s job is to answer ‘yes’ or ‘no’ to the question of whether it knows how to solve a specified problem.

The latter requires no steps or process. We test AI with a known set of answers before we release products, and we hit ‘go’ when it reaches a certain rate of correct answers. However, we cannot predict with absolute certainty what the exact answer to a question will be in advance.

This, combined with a shift in the system’s underpinning, opens the door to unexpected behaviour and restricts the ability to detect it using traditional testing techniques. In short, if you don’t know the real answer, how do you know when the AI starts getting it wrong? And if you do know the correct answer, then why are you using these networks?

Using anchors to detect drift

Imagine you are in a boat in the middle of the ocean with no land in sight. You know the currents are taking you somewhere, but you’re not sure which direction. So how do you detect which way you’re drifting, or if you’re drifting at all? A simple way is to drop an anchor and see which way the boat moves.

David Colwell, VP of AI & Machine Learning (ML) at Tricentis

In the same way, a testing approach focuses less on a prior ‘known to be correct’ answer and more towards a ‘this worked in the past’ approach. In the example, we identify the drift (the change in object type), which allows us to make informed decisions about how to respond to it. This is the shift in testing mentality required to adapt to a generative AI future.

We need to accommodate a form of testing that is closer to monitoring.

This is a shift in approach from validating everything ahead of deployment towards selectively tracking results in production to identify change over time.

In order to effectively monitor drift, you need to have two categories of tests: critical scenarios that, if the results change, are highly likely to cause significant breakages; and using categories that are important enough to monitor, even if you are unsure what type of breakage might occur. Once these scenarios have been identified they need to be ‘anchored’. This consists of recording a value used as input, and the result used as output.

Building and testing applications with generative AI is an emerging discipline, expected to continue to rapidly evolve and change. The world of versioning neural networks and published ‘changelogs’ is an area of intense research.

However, until we can get to a more deterministic answer, response anchoring, continuous monitoring, service virtualisation and test automation tools can provide us with high confidence in the stability of our application in production.