It’s time to tackle your dirty data

Enterprise Applications Editor

This is a guest blogpost by Francois Ajenstat, Chief Product Officer, Tableau

Data analysis is only as good as the quality of your data. Anyone who has ever analysed data knows that the information is rarely clean. You may find that your data is poorly structured, full of inaccuracies, or just plain incomplete. You find yourself stuck fixing the data in Excel or writing complex calculations before you can answer even the simplest of questions.

Data preparation is the process of getting data ready for analysis, including data sets, transforming, and filtering—and it is a crucial part of the analytics workflow. People spend 80% of their time prepping data, and only 20% of their time analysing it, according to a recent article from Harvard Business Review.

Most analysts in an organisation feel the pain of dirty data. The amount of time and energy it takes to go from disjointed data to actionable insights leads to inefficient/incomplete analyses and declining trust in data and decisions. These slower processes can ultimately lead to missed opportunities and lost revenue.

Organisations spend a lot of time building curated data sets and data warehouses to support the needs of the business. But even with these practices, it is likely that some level of dirty data will seep through the cracks of day-to-day operations. So, how can we solve the common data prep issues?

Issue #1: More data creates more problems

We live in a world where a variety of data is constantly being generated. With this windfall of data now flowing faster than many businesses processes, organisations are struggling to keep up. We also can’t be certain about how this information will be used in the future.

Solution: Enable self-service data preparation

Visual, self-service data prep tools allow analysts to dig deeper into the data to understand its structure and see relationships between datasets. This enables the user to easily spot unexpected values that need cleaning. Although this technology brings clarity to the data, people will still need support to understand details like field definitions.

Issue #2: Dirty data requires a waterfall approach

Analysts report that the majority of their job is not analysis, but cleaning and reshaping data. Every time new data is received, analysts need to repeat manual data preparation tasks to adjust the structure and clean the data for analysis. This ultimately leads to wasted resources and an increased risk of human error.

Solution: Develop agile processes with the right tools to support them

Every organisation has specific needs and there is no ‘one-size-fits-all’ approach to data preparation, but when selecting a self-service data preparation tool, organisations should consider how the tool will evolve processes towards an iterative, agile approach instead of creating new barriers to entry.

Issue #3: Data preparation requires deep knowledge of organisational data

Before preparing data, it is crucial to understand its location, structure, and composition, along with granular details like field definitions. Some people refer to this process as “a combo of data” and it is a fundamental element of data preparation. You would not start a long journey without a basic understanding of where you’re going, and the same logic applies to data prep.

But often, because of information silos, users have less insight into the entire data landscape of their organisation—what data exists, where it lives, and how it is defined. Confusion around data definitions can hinder analysis or worse, lead to inaccurate analyses across the company.

Self-service data preparation tools put the power in the hands of the people who know the data best—democratising the data prep process and reducing the burden on IT.

Solution: Develop a data dictionary

One way to standardise data definitions across a company is to create a data dictionary. A data dictionary helps analysts understand how terms are used within each business application, showing the fields are relevant for analysis versus the ones that are strictly system-based.

Developing a data dictionary is no small task. Data stewards and subject matter experts need to commit to ongoing iteration, checking in as requirements change. If a dictionary is out of date, it can actually do harm to your data strategy.

Issue #4: “Clean data” is a matter of perspective

Different teams have different requirements and preferences regarding what makes “well-structured” data. For example, database administrators and data engineers prioritise how data is stored and accessed—and columns may be added that are strictly for databases to leverage, not humans.

If the information that data analysts need is not already in the data set, they may need to adjust aggregations or bring in outside sources. This can lead to silos or inaccuracies in the data.

Solution: Put the power in the hands of the data experts

Self-service data prep gives analysts the power to polish data sets in a way that matches their analysis, leading to faster, ad-hoc analyses and allowing them to answer questions as they appear. It also reduces the burden on IT to restructure the data whenever an unanticipated question arises. This can also reduce the amount of duplicated efforts because other analysts can reuse these models. If the datasets are valuable on a wide scale, you can combine them into a canonical set in the future.

Issue #5: The hidden reality of data silos

Self-service business intelligence tools have opened up data analysis capabilities to every level of user, but to get insights into their data, users still need to wait and rely on IT for well-structured data. If people are not willing to wait they will simply extract data into excel. This often leads to data littered with calculation errors that have not been vetted, and eventually, results in inconsistent analysis. Repeating this process leads to an abundance of data silos, which are not efficient, scalable, or governed.

Solution: Create consistency and collaboration within the data prep process

Combatting silos starts with collaboration. Scheduling regular check-ins or a standardised workflow for questions allows engineers to share the most up-to-date way to query and work with valid data, while empowering analysts to prepare data faster and with greater confidence.

Companies should enable employees to do the prep themselves in an easy to use tool that they can share with others. This will mean that organisations can see what data employees use and where it is duplicated, so they can create processes that will drive consistency.