SodaGPT pops no-code self-serve into data quality testing

Data quality company Soda has launched SodaGPT, a generative Artificial intelligence (gen-AI) powered tool for data quality.

It enables a no-code self-serve approach for users of all backgrounds to naturally express and define data quality expectations.

SodaGPT combines the domain-specific language capabilities of SodaCL with the Natural Language Processing (NLP) power of gen-AI, to provide a platform for data consumers and data engineers to work together to produce data that can be trusted.

SodaGPT uses an open source Large Language Model (LLM) to translate natural English language queries into production-ready data quality tests in SodaCL, the human-readable, domain-specific language for data quality.

The tool provides a simple way for data consumers to become involved in data quality management whilst collaborating and lessening the load on data engineers spending time-fighting data issues, by enabling them to express and define their own data quality expectations to ensure that data is fit for purpose.

“SodaGPT is a huge step forward for the democratisation of data, providing a no-code GenAI-powered tool that ensures everyone can work with data more confidently and, as a result, make data-informed decisions and build customer experiences that are powered by reliable data,” said Maarten Masschelein, CEO, Soda. “LLMs trained on terabytes of data are one of the many trends reshaping our world and transforming the way we interact with information – the challenge is finding the tools to extract value from that data. With SodaGPT, we are ripping up the antiquated approach to data quality checks built exclusively for a technology audience that can read and write in SQL, simplifying the process for data consumers in order to free-up data engineers to focus on building new data products.”

The introduction of a new self-serve ‘contribution’ model is said to empower data consumers to express, contribute and then collaborate on data quality expectations that meet their own business requirements.

Natural language code ‘contributions’

Natural language code ‘contributions’ made using SodaGPT and automatically translated into SodaCL, facilitate seamless collaboration between the data consumers who can now define data quality expectations in their own words and the data engineers who provide critical human oversight to ensure that checks are correctly defined before being embedded into the data pipeline.

Soda claims that SodaGPT ‘shifts left’ the management of data quality and enables data to be tested as early as possible in the development lifecycle to avoid issues that might impact data products or wreak havoc on the business down the line. Soda research recently found that 60% of data engineers are still spending almost half their time dealing with data issues.

With SodaGPT, the ability to create a more robust, reliable data pipeline with problems caught before they enter production means that data consumers can be more productive using data they can trust and data engineers spend less time reactively fixing problems and more time proactively adding value straight back into the business.

Soda’s security-first approach to software development ensures that SodaGPT has been entirely trained using the company’s own internal LLM, with no dependency on OpenAI.

This means that proprietary data shared with the model through prompt-writing never leaves Soda’s SOC-II Type 2 accredited platform, guaranteeing the same high level of internal control, systems and policy privacy and protection as all other Soda products.