This is a guest blogpost by Steve Neat, General Manager, EMEA, at Alation
For a long time, the overall objective of any data transformation has been to get all of an organisation’s data into one data warehouse that can provide a single source of truth. The thinking behind this was that in order to deliver trustworthy data to all employees across an organisation, it needed to be tightly governed and, to do that efficiently, it needed to all be in the same place.
But as volumes of data from an ever-greater variety of sources have exploded, and with it the demand for using data for decision making, this fabled single source of truth has become all but impossible to achieve. It’s very hard to figure out what data to centralise, analyse, and use if you don’t know how to find it, and as a result many organisations are still reliant on fragmented data landscapes that are in a constant state of flux.
Is it time to abandon the quest for a single source of data truth and, if so, what can organisations do to ensure trustworthy data from multiple different sources?
Why data needs to be trustworthy
Before looking at how to ensure organisations can rely on trustworthy data, let’s first recap why it’s important for employees at every level of a business to trust the data they are using.
For starters, it’s worth mentioning that many people still see data as implicitly trustworthy regardless of its source. For these employees without proper data literacy training, this often leads to the assumption that all data can be trusted, meaning that when underlying systems change or databases become outdated, data is being used erroneously to make important business decisions.
Using potentially flawed data in the decision-making process not only leads to incorrect decision-making, but can have a negative impact on future data operations. If there isn’t real clarity about where the source of the data is, what it’s quality is and what it really means, how can employees really trust that data? And if they can’t trust it, the consequences can be serious, with executives developing a negative view of data-driven decision making and underinvesting in future data projects.
It’s a vicious data circle that can end in a business not fully realising the true value from arguably its most important asset.
Trusting data from different sources
It is crucial, therefore, that data is trusted and accurate, but ensuring data is reliable across multiple different sources is another challenge entirely The key is giving employees a single pane of glass through which to see all of the available data. This not only provides a single point of reference for employees that allows them to search for data on a reliable platform, but also gives them access to data from a wide range of different sources such as CRM or ERP systems.
This single pane of glass often takes the form of something called a data catalog which has become one of the most critical parts of the modern data stack. This is because a data catalog helps organisations to better understand what data they have, what it means, how they can use it and what needs to be managed and governed..
In the early days, a data catalog was largely meant for data or business analysts, providing a platform that would make them more productive. This grew to include data scientists, data governance, privacy, and compliance employees. Business and non-technical users soon followed as the importance of understanding and using data flowed through organisations in almost every industry sector. As this audience has grown, so too has the diversity and volume of information that has needed to be included in a catalog to cover each of these roles.
Analysts didn’t just want to catalog data sources, they wanted dashboards, queries, reports, and visualisations. Data scientists wanted to go beyond database tables to data lakes, data lake houses, and cloud data stores. Data scientists want to catalog not just information sources, but models. And data engineers want to catalog data pipelines.
In short, the humble data catalog that started out with one clear audience has evolved to cover multiple audiences across multiple different use cases, replacing the need for a single source of truth with a single window into a complex world of trusted data for everyone.
But what is data without context?
One of the primary reasons why organisations often struggle with their data is that there is simply too much of it. The question, therefore, of getting relevant and reliable data comes down to finding a way to empower employees with the tools to search all available data and contextualise that data with information about its validity and reliability.
Given the huge volumes of data and variety of different sources, the important bit here is to avoid simply listing all of the available data, and instead to provide information on what kind of data it is. Once again, this is where a data catalog can help.
Think of a data catalog like an Argos or a VERY catalogue. If you were trying to use these retailers to find a camera, you wouldn’t feel particularly empowered to make a good choice if you were just given a list of different makes and models. Instead, you look for the specifications and the user reviews which give important contextual information to inform a purchasing decision. Similarly, you don’t want a data catalog which just tells you where a specific bit of data is, but rather one that provides reliable information around its meaning; the popularity of the data; any trust or quality flags that you should be aware of and who else in the organisation is using this data, and so on.
This offers a window into the context of, or the metadata behind, each data item, allowing users to make decisions on the reliability of the data they are using. In a digital age when data is being created at a record pace, organisations need to understand that pushing all of this data into a single governable source is unachievable and, to an extent, ineffective.
If we agree that keeping all the data in one place is a pipedream that isn’t worth pursuing, then we should also agree that cataloging it all in one place is both a smart, and entirely viable alternative. Data doesn’t need to come from a single source of truth, but a single source of reference for everyone to be able to discover the data they need. The whole purpose of a data catalog is that it is for everyone.