This article is part of our Conference Coverage: A conference guide to AWS re:Invent 2023

AWS debuts model evaluation tool in Bedrock

The Model Evaluation tool in Bedrock will help organisations evaluate and test large language models that are best suited to their needs based on criteria like accuracy, toxicity and cost

Amazon Web Services (AWS) is making it easier for organisations to evaluate, compare and choose the large language models (LLMs) best suited to their needs through a new tool in its Amazon Bedrock service for building generative AI (GenAI) applications.

Dubbed Model Evaluation, the tool lets organisations pick the models available through Bedrock that they want to compare for a given task, select from a list of metrics, including accuracy, robustness and toxicity of the models, and upload testing datasets.

“Each of those models has a different sweet spot in terms of capability – some are good at reasoning and integration while others are good at summarisation and copywriting,” said Matt Wood, vice-president of products at AWS. “But they also have different characteristics around how long you have to wait for the answer and cost.”

Wood said organisations can also perform evaluations using their own metrics for use cases involving sensitive data that they only want to expose internally, or to service providers that can also do manual testing in Bedrock.

“This allows customers to get a better sense of which model will work well for their use case and as new models become available, they can rerun the tests to see if new versions of the models perform better than the previous versions, so they can make better decisions around price-performance trade-offs,” he said.

Wood cited Amazon Q, AWS’s new generative AI assistant that uses multiple models to help users troubleshoot networking issues in their AWS environment and improve employee productivity, among other things, as a type of use case that Model Evaluation can help with.

“Customers will be able to mix and match to find the right model for their use case. And over time, they’re going to want to combine those models – as we do with Amazon Q – so that they are able to meet their latency, cost and capability requirements with different models,” he added.

Wood said model evaluation capabilities are also available in Amazon SageMaker, a fully managed service that helps data scientists and developers build, train and deploy machine learning models. “You can do bias checks, for example, but that’s more geared towards managing the dataset that's used to train the models,” he added.

To help organisations manage the costs of running generative AI workloads, Wood said AWS has also added new ways to improve the efficiency of model-training to SageMaker, specifically through the new HyperPod capability that boosts inferencing efficiency by loading multiple models into instances to increase resource utilisation.

Bhargs Srivathsan, partner and co-lead of McKinsey’s cloud operations and optimisation work, said at this year’s Cloud Expo Asia conference that many organisations were shocked by the bills they got for their initial GenAI experiments and wanted more control and visibility over the cost of their workloads.

“Many enterprises think they need a Lamborghini to deliver a pizza,” she said. “You probably don’t need as complex and as big a model like a 65-billion-parameter model to generate customer support scripts.

“You need to identify the right model so you can be cost-effective and efficient in generating what you really need. That’s going to make or break your business case from the get-go,” she said.

Read more about cloud in APAC

Read more on Artificial intelligence, automation and robotics

Data Center
Data Management