Accenture on SLMs: Beyond the benchmarks

This is a guest post for the Computer Weekly Developer Network written by Fernando Lucini, data science and machine learning engineering lead at Accenture.

Lucini writes in full as follows…

Businesses have dared to dream.

CEOs are now eager to accomplish great things with gen AI in the workplace.

If we empower people with a wealth of knowledge and creativity at their fingertips, then the potential for productivity is profound.

But while a consumer can ask AI to generate dolls of themselves, the guardrails on what a business can do that is both interesting and useful is different. In the world of AI, ensuring the accuracy and reliability of its outputs is paramount.

Small Language Models (SLMs) have emerged as a practical solution for businesses, offering tailored capabilities without the extensive computational demands of larger models, or the dependency on an online service.

SLMs have real promise, but generating long-term value does not come without unpicking its challenges. Enterprises must first figure out how this powerful technology can safely fit it into their world and remain within ethical, legal, and safe boundaries.

Equally as important are the benchmarks to evaluate their accuracy, reasoning and overall success.

Finding the source of truth

It’s important to acknowledge the preparatory work for building a reliable and trustworthy model that generates accurate information. This is currently one of the fundamental challenges, because AI models often generate information based on patterns and data they have been trained on, without necessarily verifying the accuracy or reliability of that information.

Additionally, models undergo further reinforcement learning to train the model to adapt to our human preferences, which, can mean we are guiding the model to de-emphasise certain patterns, effectively asking it to suppress some of what it has learned. This offers its own challenges, because in trying to refine and improve the model, it is introducing human subjectivity and given the size of models, it’s difficult to ensure that these choices consistently lead to reliable, unbiased outcomes.

It is therefore essential, especially when training specialist SLMs for specific enterprise use, to prioritise curating high-quality datasets and implementing mechanisms to verify the veracity of the information they are generating.

AI models are non-deterministic by design.

They do not retrieve answers from a static database, but instead rely on several probabilities, which, based on the strength of the data, may provide the right answer. It’s critical, therefore, that businesses reduce the risk of misinformation by ensuring AI outputs are grounded on factual, high-quality data.

Evaluating reasoning & trust

Another consideration is the tendency of language models to rely on “memorised correlations” rather than genuine understanding. Businesses then need to focus on measurable trust in outputs, emphasising language understanding and reasoning.

Lucini: language models to rely on “memorised correlations” rather than genuine understanding.

This involves several components, including ensuring that the AI model’s decision-making process is transparent and understandable to users, and that the model is adjusting as needed. To measure that, businesses must have benchmarks in place and consider their applicability to the real-life situation in which the models will be implemented. For example, where SLMs can rely on memorising the right answers, rather than actually understanding why they are the right answers, this can lead to impressive performances on tests like the Stanford Helm benchmark.

In fact, there are many recent studies that conclude that even when a model has produced the correct answer – and performed well against a benchmark – it has used educated guesses, rather than sound reasoning to reach the answer. This brings us to the point that while benchmarks are a valuable foundation for evaluating models, they may not fully capture the unique reasoning requirements of specific businesses.

Creating custom benchmarks

Custom benchmarks, therefore, allow businesses to measure the performance of AI models against criteria that directly impact their operations. For example, a financial institution might create benchmarks that assess the model’s ability to detect fraudulent transactions, while a healthcare provider might focus on the model’s accuracy in supporting the diagnosis process for a medical professional.

Creating custom benchmarks involves several steps. Businesses must determine the metrics that are most relevant to the business’s goals and objectives – and then design test scenarios that reflect real-world applications and challenges the AI model is likely to encounter. Then comes evaluating the model’s performance against the custom benchmarks and identifying areas for improvement.

The benchmarking process is only going to become more important as businesses deploy SLMs.

Businesses can continue to dream – but there are practical steps to build the trust needed to bring SLM-powered innovation closer to reality.