Get more out of your data mining models

Advances in both hardware and the capabilities of database management systems make data mining a more compelling proposition...

Advances in both hardware and the capabilities of database management systems make data mining a more compelling proposition today.

The average large enterprise has terabytes of data on hand - customer information, supplier exchanges, and internal company records. Within this mountain of data lie the golden nuggets that can help solve business problems and propel new strategic initiatives. By putting on a miner's hat, you can better analyse the data you already have on hand and enrich your ability to increase revenues and reduce costs.

The plummeting cost of disc storage has enabled enterprises to store more and more data. Likewise, microprocessors keep getting more powerful, while advances in symmetrical multiprocessor technology has removed much of the overhead that once limited data mining.

Data mining is not a magic potion or a replacement for good business analysts. Data mining doesn't just hang out on servers watching data for interesting trends, paging a database analyst with the results.

An extension of traditional statistical analysis, data mining is a process wherein an organisation uses analytical tools to uncover hidden patterns and relationships in data that can be used to validate predictions made as a means to solve business problems.

Data mining has broad applicability across a large number of industries. Some enterprises use data mining to drive customer interaction.

Tom Brady, president of The Destination Group Digital, uses data mining to identify and sell properties to customers who have stayed in holiday rentals in South Beach, Florida. "We filtered our prospect file down to 7,000 target leads using data mining. We then designed a newsletter to cater to these customers."

Digging for the golden nugget

As a process, the steps you take to mine data successfully should be viewed in a circular context rather than as a linear path. Several major steps are core to any data-mining strategy.

The first step, defining the business problem, sounds straightforward enough. However, to use data-mining technology to your advantage requires that the business problem be stated as precisely as possible.

For example, a business problem stated as "the need to increase sales in the east" will yield inferior results to one stated as "the need to determine how to increase order volume for a line of fishing products in the east".

Likewise, asking, "How will offshoring company resources negatively affect the bottom line?" will net a different answer than asking, "How will offshoring company resources affect customer retention?"

David Lease, chief architect at WAM!NET Government Services, notes, "If the data-mining question is too broad, it won't work. The query needs to be narrowed and you have to have a specific goal in mind when asking business questions."

Constructing the data-mining database itself can take the bulk of the time in a data-mining process, depending on the condition and complexity of the data involved. First, you must determine the location of the data you'll need to construct the data-mining database. Is the data in one or more operational or transactional databases or already contained in a data warehouse?

Once you have identified appropriate sources, describe the data elements available from the sources you choose. You'll want to create a report that outlines the attributes of the data, for example, data type and range of possible values. Then, identify which subset of this data is needed to solve the business problem.

After subsetting the data, analysts will need to explore it for quality to determine what (if any) cleansing will be needed. Cleansing is essential for accurate data-mining results.

The cleansing process accounts for fields that might be missing data, fields that contain incorrect data, and fields with syntactical problems. You may not be able to resolve all issues with your data, but making an attempt to clean it well before mining will improve the chances for a successful outcome.

Analysts next need to determine what (if any) metadata requirements will be needed for mining and then define and execute a process to load the data-mining database. This process should be implemented as repeatable, rather than viewed as an ad-hoc or one-time event, because data changes rapidly.

Once the data-mining database has been constructed, the data must be explored in preparation for modelling. Analysts will need to use Olap, data-mining exploration aids, and other tools to select variables and rows, and to create derivative variables. This initial data exploration helps determine the best type of model to use for data mining.

A model that fits

Several different types of models can be used to mine data. Initial data exploration may, at first, lead toward one type of model. However, exploration which applies different models to the business problem is warranted to find the one that will yield the most reliable results.

Once a data model has been constructed, verify that it is the best model has been selected for the project at hand. This is likely to require a first pass of data mining with a small subset of data from the data-mining database. Examining error rates and the mining results will provide a good indicator of whether the model will solve the business problem accurately.

Another helpful approach is to execute the model against a small subset of live data and compare that with the results from the data-mining database. This is particularly useful when some data elements, say, interest rates, may trigger a different data-mining outcome.

Once the model has been validated and executed, you'll want to view the results and identify actions to be taken, or use the model to add more business rules to existing data sets. This could take the form of a flag, which is set when a particular data set matches the model (credit worthiness, for instance). You'll also need to consider how to maintain your model over time given changes in business and data elements.

Are you sure that's mining?

Confusion exists over how data mining relates to data warehousing, data marts, and Olap. 

"Olap is all about what happened in the past, as it just shows you a view of tables you already have. Only data mining (uses data) to help you predict the future," says David Smith, product manager of Insightful.

Data mining complements technologies, such as data warehouses and Olap, rather than replacing them. For example, users with a data warehouse are likely to have already performed data cleansing. Extracting a subset of that data to a data mart for mining is then a fairly simple task.

Many business analysts already use Olap tools to examine data. If you use traditional query or reporting tools, you can see what your data contains. Olap tools allow analysts to go further to gain an understanding of certain data pattern outcomes.

Examining the income-versus-debt ratio to determine creditworthiness is an example of this capacity, but it requires that the analyst develop a theory and then use Olap tools to query the data to validate or invalidate it.

By contrast, data mining does not rely on a hypothesis to uncover patterns in data. The data itself is used to identify patterns that may address a business problem. Using data mining to determine creditworthiness, for example, may link income and debt, but it may also identify years of employment as a contributing factor.

Olap can be used to help theorise the effects of the data-mining outcome (say, of creditworthiness) on the corporate bottom line. Likewise, Olap technologies can help analysts explore and better understand enterprise data before mining. In this regard, Olap and data mining can work hand in hand.

Selecting data-mining tools

There is no shortage of solutions to enable a successful data-mining strategy (see for an exhaustive list of commercial tools). You can also find many equally effective open-source products. Whichever tools you choose, it's key to implement data mining as an ongoing process.

As a core business and technology strategy, data mining can increase revenue and reduce costs, offering a competitive edge in good times or bad.

Maggie Biggs writes for InfoWorld.

Read more on CW500 and IT leadership skills