“There’s no need to clean the data – just extract it from the source systems and load it into the warehouse. Our data is already clean.”
I wonder how many times some poor extract, transform, load (ETL) consultant has heard those words, smiled diplomatically and then scanned the room for a sturdy wall to bang their head against repeatedly.
It sounds so straightforward: Pull data from the source systems – relational database, flat-file, spreadsheet, and so on – then load it into a data warehouse.
The problem is that within a transactional system, data rarely appears to be anything other than squeaky clean. This is because transactional systems typically touch just a single record or a handful of records at a time.
Contrast that to an analytical system, which aggregates vast numbers of records. Like a white carpet, when data is analysed like this, the dirt really shows.
Common data quality problems
Data within source systems is invariably of a lesser quality than expected, often due to lack of enforcement mechanisms:
Referential integrity – for example, an orders inventory with an order allocated to customer number CB3742, where that customer does not appear within the customer inventory.
Use of NULL – for example, a customer’s telephone number is given as NULL. It is unclear whether the customer has no telephone, is withholding their number, or has never been asked to give their number.
Read more on data quality
Value checking for reasonableness – for example, a customer’s date of birth is apparently 07/01/1869.
Values constrained to a pre-defined domain – for example, a user adding a record can supply a customer title of “Reverund”, instead of selecting from a list of pre-defined values such as Mr, Mrs, Miss, Ms, Dr, Rev.
Business logic – for example, customers with an AAA credit rating must have held an account for at least five years, but there is no mechanism in place to enforce this rule.
The data within source systems should be profiled for two reasons.
First, a realistic estimate can be made of the effort involved in cleansing each system’s data. The dirtier the data, the more effort – time and cost – required.
Second, for those in denial, evidence can be presented that the data is not of the expected quality.
Using tools no more exotic than SQL queries, data can be profiled within a relational source system and its quality established.
On a customer table we could run the query:
Which might yield:
The previously unseen dirt on the carpet is suddenly visible.
Another example on the customer table finds customers with unreasonable ages:
This example finds addresses where the first line is unreasonably short:
This final example finds non-existent customers:
Although data can be profiled with SQL queries, tools exist specifically for this purpose. Several suppliers produce such data profiling tools, and in the second part of this article we will look at the tool provided with SQL Server.
When extracting data from multiple sources, data conformity needs to be exercised even if the original data is clean within each individual source system.
The conformity process has three stages.
Extracting data from three sources, gender might be represented by a bit flag (1=Male, 0=Female) on the first system, a single character value of “M” or “F” on the second, and “Male” or “Female” on the third.
This data may well be clean within the separate sources, but to merge the three sources into one in the warehouse, the data must be standardised.
We might decide that gender within the warehouse will be represented as values of “Male”, “Female” and “Unknown”.
From the first system, the bits will be transformed to these values. From the second system the single character values similarly so.
The third system would be checked to ensure that only “Male” and “Female” values exist, with misspellings corrected. NULL values would be converted to “Unknown” labels within the warehouse.
Sometimes standardisation will be more difficult than this straightforward example, such as for addresses where specialised tools may be required.
2. Matching duplicate records
The next stage is to identify which standardised records from different sources relate to the same entity. These records can then be grouped together for the final stage.
Sometimes this process will be easy, such as with records of employees from different sources which all contain an employee number, but sometimes it will be much more difficult.
Survivorship is the final stage whereby matched records are merged into a single record. The best-quality values from each of the matched records are combined, resulting in a single record of the highest possible quality.
In this article we discussed common data quality misconceptions – and why ETL consultants get through a lot of Aspirin. We looked at why data is dirty, and how to see the dirt. We also talked about data conformity.
In part two, we will explore the data profiling tool provided with SQL Server, and see how it can help us identify data quality issues.
Andy Hogg is a freelance consultant specialising in Microsoft SQL Server and associated technologies.
This was first published in January 2013