Most UK companies are now heavily reliant on the data they hold in their computer systems. To work out how reliant your company is, try this quick test: Imagine that tomorrow morning you arrive at work to find that all the data held by your company has become mangled, completely and irrevocably. What would you do?
Such reliance inevitably means that data quality must be of crucial importance. So it is an odd paradox that, in my experience, most companies never carry out a formal data quality audit. This is probably because, on a daily basis, the data works. Most orders are dispatched on time and most contain the correct components, so we muddle along in the time-honoured British way. The problem is that significant staff time is wasted dealing with irate customers or tracking missing components when things go wrong because of data errors. Poor data quality does cost money.
Developing a data quality strategy
Developing a data quality strategy isn’t easy – but it can, and should, be done. What follows is a five-step process for identifying your organization’s data issues and creating a data quality framework to help you address them.
Step 1: Determine the data items that are causing problems.
In a perfect world, we take a highly structured approach to data quality. That may well involve a full data audit, inspecting every data item (EmployeeFirstName, CustomerLastName, etc.) in every transactional system. In such audits, each data item is assessed for:
- Its quality level
- The damage that poor quality data is causing
- The cost of fixing bad data
In practice, most companies only begin to take an interest in data quality when they encounter specific, and usually expensive, problems. So a less structured but more pragmatic approach is to start by identifying the data items that are actually causing problems and then fix them.
Step 2: Profile the offending data.
Data profiling is simply a way of examining the diversity of data in a given data item; many extract, transform and load (ETL) tools have built-in profilers. For numerical items, profiling will tell us the smallest and largest values, the mean, the mode, the standard deviation, the distribution and so on. For text data, it will tell us things such as the mean length, the distribution and frequency of values.
Step 3: Determine what the data should be
Profiling is easy; working out what the data should be like is more difficult. This will involve determining the acceptable domain of values – essentially a list of all acceptable values. For example, for a Gender data item, only “Male” or “Female” might be deemed acceptable, or “M” or “F” if you favour abbreviation. We can also determine the rules that can be used to verify the data (e.g., those entitled “Mr.” can only be male, whereas a “Professor” may be of either gender). This step usually involves a great deal of discussion with users of the systems to identify what constitutes acceptable data.
Step 4: Determine how to clean the data.
Again, a challenging step. What updates can be run to clean existing data? And what changes should be made to the applications to ensure that poor-quality data cannot be entered? (If the applications have been bought from a third party, such changes may be impossible to make in practice – so the only alternative may be to run cleaning processes every night.)
Step 5: Decide if the changes are worth it.
Data quality is all about return on investment (ROI). Assuming (in a rather arbitrary fashion) that your data is currently 90% clean overall, we can estimate the cost to the company of having errors in the data; this clearly decreases as data quality improves. We can also estimate how much it costs to increase the quality to 91%, 92% and so on up to 100%. The charts below are purely for illustration but give some idea of the overall pattern.
Based on some 20 years of experience, I have found that really bad data (and 10% incorrect is very bad) is almost unusable. If you are using it, then it is highly likely that it is costing you a great deal. So, cleaning up such data should be very cost-effective. As the data quality improves, the cost of the bad data decreases but there are still residual costs – for example, you need to have mechanisms or people or both in place to deal with the effects of the remaining errors. When you finally get the data quality level to 100%, the cost of bad data drops to zero, as shown on the red curve in Figure 1.
The cost of cleaning data is different. If you start with really bad data, it is relatively easy to spot the gross errors and fix them, hence the initial relatively low cost of fixing the data. As you get closer to data quality perfection, you tend to be chasing much more obscure errors, which are much harder to fix. As a result, the cost of fixing the errors increases.
Figure 1 Cost of fixing errors increases as perfection approaches
Given the cost figures, we can plot the incremental gain from cleaning the data.
Figure 2. Incremental gain decreases as data approaches 100% cleanliness
The highest point on the curve in Figure 2 represents the optimal point of data cleansing in terms of ROI. Of course, this whole scheme is complicated by the fact that the cost of fixing the data varies with different data items. In addition, the cleansing may incur a one-off cost while the savings may be yearly; if so, that will push the optimal point further to the right.
The bottom line is that data quality is an important but complex area. All too often it is avoided because it is so complicated. However, once data quality problems have a significant effect on the actual bottom line, it’s highly likely that taking action – using formal data quality strategy – will provide a significant ROI.
About the author: Dr. Mark Whitehorn specializes in the areas of data analysis, data modeling, data warehousing and business intelligence (BI). Based in the U.K., he works as a consultant for a number of national and international companies, designing databases and BI systems. In addition to his consultancy practice, he is a well-recognized commentator on the computer world, publishing about 150,000 words a year, which appear in the form of articles, in publications such as PCW and Server Management Magazine, white papers and books.
He has written nine books on database and BI technology. The first one "Inside Relational Databases" (1997) is now in its third edition and has been translated into three other languages. The most recent is about MDX (a language for manipulating multi-dimensional data structures) and was co-written with the original architect of the language – Mosha Pasumansky. Mark has also worked as an associate with QA-IQ since 2000. He developed the company's database analysis and design course as well as its data warehousing course.