It is easy to think of data quality analysis in terms of absolutes – a date of birth is, after all, either right or wrong. But in many other cases, data quality is not clear-cut and it can vary, depending on what use you are making of the data.
This difference is highlighted when we compare the use of data in transactional and analytical systems. In a transactional human resources (HR) system, for example, we tend to examine one record at a time. In the gender field we might see an entry that reads ‘Male’, in the next record ‘F’ and in a third, ‘Femail'. While we may question the spelling ability of the person who entered the data, we would not be in any doubt about the gender of the people whose records we were inspecting. We could argue that the data quality here is acceptable because it is fit for use. However, suppose we pull that data into an analytical system in order to report on the number of males and females employed. The same data is clearly not fit for this analytical purpose.
Not all men are hunks: a data quality issue!
I recently looked at the data collected by a very respectable company and performed just this analysis. We found all the usual suspects plus one I’d never seen before: ‘Hunk’. Querying revealed that all hunks were males under 30 and that all the records using the unusual terminology were entered during the period when a young female temporary secretary was in post. Draw your own conclusions here; we certainly did. Happily, an update query quickly calmed the horrified head of HR.
Data is typically collected in the transactional database systems that underpin the applications (stock control, sales and so on) that run the business. This is the data collected day by day as the business moves forward. For analytical purposes, copies of this data are often extracted from these transactional systems (also described as source systems), transformed and loaded into a data warehouse.
Bearing in mind that some ‘bad’ data can still be fit for purpose, we may decide to leave it alone.
So, given that background, we can try to answer some common questions about bad data.
Where does bad data come from? Typically from applications that feed data into the source systems. The usual culprits are open text fields into which anything can be typed. One excellent way to fix bad data is to modify the user interface (UI) so that only valid data can be entered.
It is time-consuming and expensive to fix bad data. Why? It is not expensive to run update queries to fix bad data: The time and expense lies in defining exactly what is good and what is bad. In order to create that definition we need to work closely with users of the data to find, for example, the acceptable domains of values for every field. For instance, the users might tell us that the gender field should contain either ‘Male’ or ‘Female’. (The UI might allow ‘M’ or ‘F’ to be entered but would use a look-up table to substitute the acceptable value in the field.) Is ‘Not Known’ an acceptable entry? The users will know. A domain doesn’t have to be a list of valid entries; it can be a range of values, from zero to 100, for instance. In addition, we need to find the business rules that also define the difference between good and bad data. For example:
- All orders marked ‘Priority’ must be for more than £10,000.
- All members of staff must be over 18 years of age.
Do we always have to fix it? Bearing in mind that it is expensive to fix bad data and that, as discussed above, some ‘bad’ data can still be fit for purpose, we may decide to leave it alone. This can sound like heresy, but data quality is only an issue when it affects the bottom line of the business. If it costs more to fix it than to leave it alone, a simple return on investment calculationtells us to leave it. Having said all of that, if the balance of cost vs. benefit is about even, I would always lean towards fixing it. Bad data has a tendency to bite you later, so if the opportunity arises, and it isn’t too expensive, I sleep better if it is fixed.
Where is the best place to fix it? Broadly speaking, there are three approaches to consider:
1. Fix the existing data in the transactional systems.
2. Improve the applications so no more bad data can be entered.
3. Tolerate bad data in the source systems and fix it when we perform the extract, transform and load (ETL) process to move it into the data warehouse.
Armed with an understanding of what constitutes good and bad data, the first approach is to fix existing data. We can run a series of data quality analysis queries to identify bad data and run update queries to make it good. That’ll sort out the data already in the system but won’t stop more bad data arriving in future.
This leads us to consider the second approach: We can rewrite the applications to keep out bad data. Realistically, this is often impossible. With older applications the source code may be lost, and with third-party software, the manufacturer may be slow to apply the necessary changes (or at worst uninterested in doing so). In some cases the best option may be to run nightly batch jobs that identify bad data and update it to good. You can also train the staff who input the data to enter data more carefully.
If you decide to fix just for analytical accuracy, you can either use the methods above or write data cleansing into the ETL process. ETL is typically undertaken using a tool, and many have good built-in mechanisms for cleansing data.
Is there anything else we can do? It can also be worth considering data mining techniques for spotting bad data. A system can be trained to recognise the ‘shape’ of correct data and flag any significant deviations. This is relatively new technology, but such systems are already in use, either as part of the ETL process to inspect data coming in from source systems, or as part of a nightly batch job running against data in transactional systems.
The whole area of data quality analysis and management is a complex one, and in recent years it has been brought into sharp relief by the growing requirement for business intelligence and analytics. The ETL layer in any data warehouse system is itself complex, and there exists a school of thought (to which I subscribe) that says it is impossible to write a set of rules for cleaning data that is 100% effective. Whatever rules you write, if the input isn’t controlled – and often it can’t be – eventually new (and imaginative) forms of bad data will be typed in. This is where the use of data mining in cleansing data is so helpful. It is pattern-based, not rule-based, so it has a much better chance of finding those new variants that we will never write a rule for because we’ve never seen them before. Who would have predicted ‘Hunk’?
Dr. Mark Whitehorn specializes in the areas of data analysis, data modeling and business intelligence (BI). Based in the UK, Whitehorn works as a consultant for a number of national and international companies and is a mentor wit Solid Quality Mentors. In addition, he is a well-recognized commentator on the computer world, publishing articles, white papers and books. Whitehorn is also a senior lecturer in the School of Computing at the University of Dundee, where he teaches the masters course in BI. His academic interests include the application of BI to scientific research.