Data quality everywhere

This is a guest blog by Jean Michel Franco, Talend

Data quality follows the same principles as other well-defined, quality-related processes. It is all about creating an improvement cycle to define and detect, measure, analyse, improve and control.

This should be an ongoing effort – not just a one-off. Think about big ERP, CRM or IT consolidation projects where data quality is a top priority during the roll out, and then attention fades away once the project is delivered.

A car manufacturer, for example, makes many quality checks across its manufacturing and supply chain and needs to identify the problems and root causes in the processes as early as possible. It is costly to recall a vehicle at the end of the chain, once the product has been shipped – as Toyota experienced recently when it recalled six million vehicles at an estimated cost of $600 million.

Quality should be a moving picture too. While working through the quality cycle, there is the opportunity to move upstream in the process. Take the example of General Electric, known for years as best-in-class for putting quality methodologies such as Six Sigma at the heart of its business strategy. Now it is pioneering the use of big data for the maintenance process in manufacturing. Through this initiative, it has moved beyond detecting quality defects as they happen. It is now able to predict them and do the maintenance needed in order to avoid them.

What has been experienced in the physical world of manufacturing applies in the digital world of information management as well. This means positioning data quality controls and corrections everywhere in the information supply chain. And I see six usage scenarios for this.

Six data quality scenarios

The first one is applying quality when data needs to be repurposed. This scenario is not new; it was the first principle of data quality in IT systems. Most companies adopted it in the context of their business intelligence initiatives. It consolidates data from multiple sources, typically operational systems and gets it ready for analysis. To support this scenario, data quality tools can be provided as stand-alone tools with their own data marts or, even better, tightly bundled with data integration tools.

A similar usage scenario, but “with steroids”, happens in the context of big data. In this context, the role of data quality is to add a fourth V, for Veracity, to the well-known 3 Vs defining big data; Volume, Variety and Velocity. Managing extreme Volumes mandates new approaches for processing data quality; controls have to move where the data is, rather than the opposite way. Technically speaking, this means that data quality should run natively on big data environments such as Hadoop, and leverage its native distributing processing capabilities, rather than operate on top as a separate processing engine. Variety is also an important consideration. Data may come in different forms such as files, logs, databases, documents, or data interchange formats such as XML or JSON messages. Data quality would then need to turn the “oddly” structured data often seen in big data environments into something that is more structured and can be connected to the traditional enterprise business objects, like customers, products, employees and organisations. Data quality solutions should then provide strong capabilities in terms of profiling, Parsing, standardisation, entity and resolution. These capabilities can be provided before the data is stored and designed by IT professionals. This is the traditional way to deal with data quality. Or, data preparation can be delivered on an ad-hoc basis at run time by data scientists or business users. This is sometimes referred to as data wrangling or data blending.

The third usage scenario lies in the ability to create data quality services. Data quality services allow the application of data quality controls on demand. An example could be a web site with a web form to catch customer contacts information. Instead of letting a web visitor type in any data they want in a web form, a data quality service could apply checks in real time. This then gives the opportunity of checking information such as emails, address, name of the company, phone number, etc. Even better, it can automatically identify the customer without requiring them to explicitly logon and/or provide contact information, as social networks, best-in-class websites or mobile applications such as already do.

Going back to the automotive example, this case provides a way to cut the costs of data quality. Such controls can be applied at the earliest steps of the information chain, even before erroneous data enters into the system. Marketing managers may be the best people to understand the value of such a usage scenario; they struggle with the poor quality of the contact data they get through the internet. Once it has entered into the marketing database, poor data quality becomes very costly and badly impacts key activities such as segmenting, targeting, calculating customer value. Of course, the data can be cleansed at later stages but this requires significant effort to resolve, and the related cost much higher.

Then, there is quality for data in motion. This applies to data that flows from one application to another; for example, to an order that goes from sales to finance and then to logistics. As explained in the third usage scenario, it is best practice that each system implements gatekeepers at the point of entry, in order to reject data that doesn’t match its data quality standards. Data quality then needs to be applied in real time, under the control of an Enterprise Service Bus. This fourth scenario can happen inside the enterprise and behind its firewall. Alternatively, data quality may also run on the cloud, and this is the fifth scenario.

The last scenario is data quality for Master Data Management (MDM). In this context, data is standardised into a golden record, while the MDM acts as a single point of control. Applications and business users share a common view of the data related to entities such as customers, employees, products, chart of accounts, etc. The data quality then needs to be fully embedded in the master data environment and to provide deep capabilities in terms of matching and entity resolution.

Designing data quality solutions so that they can run across these scenarios is a driver for my company. Because one of the things about our unified platform is that it generates code that can run everywhere, our data quality processing can run in any context, which we believe is a key differentiator. Data quality is delivered as a core component in all our platforms; it can be embedded into a data integration process, deployed natively in Hadoop as a Map Reduce job and be exposed as a data quality service to any application that needs to consume it in real time.

Even more importantly, data quality controls can move up into the information chain over time. Think about customer data that can be initially quality proofed in the context of a data warehouse through data integration capabilities. Then, later, through MDM, this unified customer data can be shared across applications. In this context, data stewards can learn more about the data and be alerted that they are erroneous. This will help then to identify the root cause of bad data quality, for example a web form that brings junk emails into the customer database. Data services can then come to the rescue to avoid erroneous data inputs on the web form, and reconcile this entered data with the MDM through real time matching. And, finally big data could provide an innovative approach for identity resolution so that the customer can be automatically recognised by a cookie after they opt-in, making the web form redundant.

Such a process doesn’t happen overnight. Continuous improvement is the target.