Advances in both hardware and the capabilities of
database management systems make data mining a more compelling
proposition today.
The average large enterprise has terabytes of data on hand -
customer information, supplier exchanges, and internal company
records. Within this mountain of data lie the golden nuggets that
can help solve business problems and propel new strategic
initiatives. By putting on a miner's hat, you can better analyse
the data you already have on hand and enrich your ability to
increase revenues and reduce costs.
The plummeting cost of disc storage has enabled enterprises to
store more and more data. Likewise, microprocessors keep getting
more powerful, while advances in symmetrical multiprocessor
technology has removed much of the overhead that once limited data
mining.
Data mining is not a magic potion or a replacement for good
business analysts. Data mining doesn't just hang out on servers
watching data for interesting trends, paging a database analyst
with the results.
An extension of traditional statistical analysis, data mining is
a process wherein an organisation uses analytical tools to uncover
hidden patterns and relationships in data that can be used to
validate predictions made as a means to solve business
problems.
Data mining has broad applicability across a large number of
industries. Some enterprises use data mining to drive customer
interaction.
Tom Brady, president of The Destination Group Digital, uses data
mining to identify and sell properties to customers who have stayed
in holiday rentals in South Beach, Florida. "We filtered our
prospect file down to 7,000 target leads using data mining. We then
designed a newsletter to cater to these customers."
Digging for the golden nugget
As a process, the steps you take to mine data successfully
should be viewed in a circular context rather than as a linear
path. Several major steps are core to any data-mining strategy.
The first step, defining the business problem, sounds
straightforward enough. However, to use data-mining technology to
your advantage requires that the business problem be stated as
precisely as possible.
For example, a business problem stated as "the need to increase
sales in the east" will yield inferior results to one stated as
"the need to determine how to increase order volume for a line of
fishing products in the east".
Likewise, asking, "How will offshoring company resources
negatively affect the bottom line?" will net a different answer
than asking, "How will offshoring company resources affect customer
retention?"
David Lease, chief architect at WAM!NET Government Services,
notes, "If the data-mining question is too broad, it won't work.
The query needs to be narrowed and you have to have a specific goal
in mind when asking business questions."
Constructing the data-mining database itself can take the bulk
of the time in a data-mining process, depending on the condition
and complexity of the data involved. First, you must determine the
location of the data you'll need to construct the data-mining
database. Is the data in one or more operational or transactional
databases or already contained in a data warehouse?
Once you have identified appropriate sources, describe the data
elements available from the sources you choose. You'll want to
create a report that outlines the attributes of the data, for
example, data type and range of possible values. Then, identify
which subset of this data is needed to solve the business
problem.
After subsetting the data, analysts will need to explore it for
quality to determine what (if any) cleansing will be needed.
Cleansing is essential for accurate data-mining results.
The cleansing process accounts for fields that might be missing
data, fields that contain incorrect data, and fields with
syntactical problems. You may not be able to resolve all issues
with your data, but making an attempt to clean it well before
mining will improve the chances for a successful outcome.
Analysts next need to determine what (if any) metadata
requirements will be needed for mining and then define and execute
a process to load the data-mining database. This process should be
implemented as repeatable, rather than viewed as an ad-hoc or
one-time event, because data changes rapidly.
Once the data-mining database has been constructed, the data
must be explored in preparation for modelling. Analysts will need
to use Olap, data-mining exploration aids, and other tools to
select variables and rows, and to create derivative variables. This
initial data exploration helps determine the best type of model to
use for data mining.
A model that fits
Several different types of models can be used to mine data.
Initial data exploration may, at first, lead toward one type of
model. However, exploration which applies different models to the
business problem is warranted to find the one that will yield the
most reliable results.
Once a data model has been constructed, verify that it is the
best model has been selected for the project at hand. This is
likely to require a first pass of data mining with a small subset
of data from the data-mining database. Examining error rates and
the mining results will provide a good indicator of whether the
model will solve the business problem accurately.
Another helpful approach is to execute the model against a small
subset of live data and compare that with the results from the
data-mining database. This is particularly useful when some data
elements, say, interest rates, may trigger a different data-mining
outcome.
Once the model has been validated and executed, you'll want to
view the results and identify actions to be taken, or use the model
to add more business rules to existing data sets. This could take
the form of a flag, which is set when a particular data set matches
the model (credit worthiness, for instance). You'll also need to
consider how to maintain your model over time given changes in
business and data elements.
Are you sure that's mining?
Confusion exists over how data mining relates to data
warehousing, data marts, and Olap.
"Olap is all about what happened in the past, as it just shows
you a view of tables you already have. Only data mining (uses data)
to help you predict the future," says David Smith, product
manager of Insightful.
Data mining complements technologies, such as data warehouses
and Olap, rather than replacing them. For example, users with a
data warehouse are likely to have already performed data cleansing.
Extracting a subset of that data to a data mart for mining is then
a fairly simple task.
Many business analysts already use Olap tools to examine data.
If you use traditional query or reporting tools, you can see what
your data contains. Olap tools allow analysts to go further to gain
an understanding of certain data pattern outcomes.
Examining the income-versus-debt ratio to determine
creditworthiness is an example of this capacity, but it requires
that the analyst develop a theory and then use Olap tools to query
the data to validate or invalidate it.
By contrast, data mining does not rely on a hypothesis to
uncover patterns in data. The data itself is used to identify
patterns that may address a business problem. Using data mining to
determine creditworthiness, for example, may link income and debt,
but it may also identify years of employment as a contributing
factor.
Olap can be used to help theorise the effects of the data-mining
outcome (say, of creditworthiness) on the corporate bottom line.
Likewise, Olap technologies can help analysts explore and better
understand enterprise data before mining. In this regard, Olap and
data mining can work hand in hand.
Selecting data-mining tools
There is no shortage of solutions to enable a successful
data-mining strategy (see
www.kdnuggets.com for an
exhaustive list of commercial tools). You can also find many
equally effective open-source products. Whichever tools you choose,
it's key to implement data mining as an ongoing process.
As a core business and technology strategy, data mining can
increase revenue and reduce costs, offering a competitive edge in
good times or bad.
Maggie Biggs writes for InfoWorld