In a world of big data, insight is king. And whether it’s detecting the likelihood of fraud from millions of credit card transactions, giving businesses useful insights into their customers, or pinpointing distant galaxies from a mass of astronomical data, one IT discipline is becoming an increasingly bright star: data mining.
The concept first came to commercial prominence about 15 years ago. For several years, you couldn’t attend an IT conference without at least one of the speakers wheeling out the urban myth of giant US retailer Walmart’s use of the technology to uncover previously unnoticed patterns in shoppers’ purchasing habits.
Most famously, it is said, the company found out if could sell more alcohol in the evenings if it displayed a selection of beer next to its baby products. The reason for this is that fathers of young children were routinely being asked by their wives to pick up nappies on the way home from work. If dad saw a crate of beer while he was in the supermarket, he’d be far more likely to grab one.
Back then, data mining was largely the preserve of very large corporations with huge stores of data (indeed, data mining went hand in hand with the concept of data warehousing). But with the exponential growth of mobile technology, social networking and cloud computing in the intervening years, not only is there a lot more unstructured data out there that could potentially be hiding valuable insights, but a lot more businesses can afford to make use of it.
Today, many organisations are trying to develop ‘big data’ strategies. And while there’s less hype around the concept of data mining than there once was, it remains a critical – and growing – discipline for any data scientist. But what exactly is it, and how does it differ from business analytics?
For more on data mining
Essentially, data mining is a subset of analytics that uses mathematical algorithms – including machine learning and artificial intelligence techniques – to examine vast datasets and uncover previously unseen patterns and correlations. It differs from other types of statistical analysis in that, rather than testing a set hypothesis, it slices and dices data in many different ways until it spots something interesting.
Laurie Miles, head of analytics for SAS in the UK and Ireland, says: “If I’m building a predictive model to ascertain the likelihood of a card transaction being fraudulent, I’m doing data mining. But if I switched the focus of my analysis and used time series forecasting [where you take various snapshots of data over time and use that to project forward] to find out how many cans of beans Waitrose will need in its Henley store at a given time, all of a sudden I’m an econometrician or a forecaster.
“As a statistician, I’m given a hypothesis to test and prove. As a data miner, there’s a much looser brief – for example, ‘find some interesting patterns that help me sell beer’.”
While many different algorithms have been developed for data mining, they generally fall into one of a handful of categories [see panel below]. Which ones data scientists choose to use depends on the type of business problem they are solving. “Typically, I’d throw the data at all of them and keep the one that gives me the best statistical results, unless there’s a business reason I’d want to do something else,” says Miles.
Correlation and causality
Yet while data mining techniques can throw up new and fascinating patterns in data, it still takes the skills of an experienced data scientist to weed out what’s genuinely useful from what’s merely interesting. Automated pattern recognition can identify a bunch of things in data that are correlated, but that doesn’t necessarily mean there is any causal link between X and Y.
As Miles says: “If you discover that a lot of people with ginger hair buy [UK] size 8 shoes, for example, that’s irrelevant to the manufacturer. You have to uncover insights that a business can use to its advantage. So while throwing data at a model can take you from 1,000 interesting things to, say, 100, a creative analyst still needs to look at those 100 things and work out what’s really of value.”
Data mining algorithms
The oldest, but still widely used, type of data mining algorithm. Miles says: “It’s a bit like drawing a ‘best fit’ line on a graph of scattered points, except over many more dimensions than the two of a graph, and using squiggly as well as straight lines. It comes up with a simple equation that you can plug numbers into, but it doesn’t allow you to explore the data as you’re building it. You throw data at it and it comes up with a model.”
These split data into various groups to develop rules that will predict the likelihood of something happening. For example, if you were looking at whether someone was likely to default on a loan, you might split a set of data on individuals into (variously) men and women, under and over 30s, married and single, where they live, and so on. The algorithm then looks at the proportion of loan defaulters in each category to make useful future predictions. Miles says: “It’s just based on rules, so there’s no need for equations. We use it a lot for marketing models, since you can explain it to marketing people really easily. ”
Neural networks, a branch of AI, is a machine-learning technique that uses complicated mathematics, such as hyperbolic tangents, to automatically create accurate predictive models in minutes that could take a human mathematician many months or even years. Miles says: “It’s a bit of a ‘black box’, but it draws really squiggly lines through data and is excellent where you’re after speed and accuracy. We use it to predict the likelihood of fraud at the point of sale for HSBC, for example.”
These are the next generation of machine-learning algorithms, and are likely to grow in importance to data scientists as the technology develops. Technically speaking, they detect patterns in large datasets by constructing a set of hyperplanes in multiple (or even infinite) dimensions. Miles says: “They’re another ‘black box’ and, depending on the shape of your data, they can make the most accurate automated predictions of all.”
This was first published in July 2014