Cloud data quality, integration open new vistas for managing data

Find out what opportunities the cloud could open up for data quality and integration. Advantages: scalability, reduced admin and costs. ETL, as a data quality sweet spot, lends itself to the cloud.

The cloud has been drifting around for a while and offers three fundamental advantages for IT applications: improved scalability, reduced administration and lower costs. Clearly, all of those benefits can be generally applied to cloud data quality and integration (DQI) deployments; the more interesting question is what makes DQI an especially good candidate for cloud computing.

The answer is that most of the time, we only cleanse and integrate data when we are moving it from the transactional systems that run the day-to-day business to the data warehouses that allow us to do analytical work. In most cases, that is done through extract, transform and load (ETL) processes.

For more on data quality in the cloud

Data quality can be an afterthought in the cloud paradigm: ‘Enterprises jump into deep end with cloud data quality, now must learn to swim’

Read SearchDataManagement’s guide to cloud data management technology and trends

Gartner suggests SaaS-based data quality and integration tools are gaining momentum

ETL is typically performed as an overnight batch process -- in other words, the required servers are used very intensively but maybe only for four hours out of 24. That alone makes DQI a prime candidate for the cloud: if you keep it in-house and run your ETL jobs on dedicated systems, you spend good money on servers that do processing work only about 16% of the time. And of course, you need systems administrators for them as well. In the cloud (as long as you strike the correct deal with your service provider), you pay only for the CPU cycles and disk space that you use.

Of course, if you currently do run ETL batch processes in-house, you may well already have strategies in place to minimise the financial damage. You may virtualise the ETL servers within a larger system to even out the workload. Or you may move some of the complex data cleansing and integration issues down to data marts (which are often idle during ETL), thus reducing the number of dedicated servers you need. Ultimately, the cloud is never the sole solution to any data management problem, but it is one of several viable options and should be considered along with the others.

Advantage: Data cleansing

Data cleansing falls into two main categories. One is applying internal business rules to data to assess and improve its quality. These rules can be simple -- checking to make sure that that the employment start date for workers is at least 18 years after their date of birth, for example -- or considerably more complex.

The second category is checking your existing data against external data sources, which is also ideal cloud data quality territory. For example, suppose that your company is being hampered by poor-quality address data in its customer systems. You can buy in post code data from outside and perform lookups in-house to improve the quality of the address data. Or perhaps one of the analytical requirements of a data warehouse is that sales data be plotted against weather data. You don’t collect the latter information in-house, so you purchase and import it from an external source. Or in both cases, you could turn to a company offering DQI services in the cloud to provide such lookup services for you. Users taking advantage of such services would have to pay for them, of course, but consolidating the work at the service provider level means that it can be done much more cost-effectively.

It is easy to see the cloud as just an extra-large server farm in the sky, and there are potential benefits in that scenario alone: being able to quickly add data processing capacity while letting somebody else keep the servers up and running and patched and protected, reducing your need for both hardware and technical staff. But it is worth looking at how your business operates to see if any internal processes lend themselves to becoming cloud-based, including data quality and integration.

About the author
Mark Whitehorn works as a consultant for national and international companies. He specializes in the areas of databases, data analysis, data modeling, data warehousing and business intelligence (BI). Professor Whitehorn also holds the chair of Analytics at the University of Dundee where he works as an academic researcher, lecturer and runs a Masters programme in Business Intelligence.

Read more on Data quality management and governance