Big data analytics and the end of sampling as we know it

To sample or not to sample? It didn't used to be a question. Data sets were so huge and compute resources so overwhelmed that BI professionals were resigned to sampling. Enter big data analytics.

To sample or not to sample? It didn't used to be a question. Data sets were so huge, and compute resources so inadequate, that most BI professionals simply accepted sampling as a kind of pragmatic (albeit inadequate) necessity.

The good news is that BI professionals now have a better choice: big data analytics, according to experts.

“If you really want the lowdown on what’s happening in your business, you need large volumes of highly detailed data,” wrote Philip Russom, research director for data warehousing with The Data Warehousing Institute (TDWI), in Big Data Analytics, a r ecent TDWI  report. “If you truly want to see something you’ve never seen before, it helps to tap into data that’s never been tapped for business intelligence or analytics.”

Building an analytics driven organisation: a TDWI ebook

Business intelligence and analytics: how to develop a complementary strategy

Mobile BI sparks recast of business intelligence

BI dashboards help South Leicestershire College track performance

That's the radical raison d’être of big data analytics, and it’s radical because it is unprecedented. Not the notion of big data itself, which -- as Russom reminds us -- dates back at least to “the early 2000s, [when] storage and CPU technologies were overwhelmed by the numerous terabytes of big data … to the point that IT faced a data scalability crisis.” What's unprecedented is the application of advanced analytics technologies (such as data mining) to massive and diverse data sets. That's what's meant by big data analytics, the advent of which, Russom said, signals the end of this data scalability crisis.

It used to be that organisations couldn't meaningfully process -- i.e., mine, analyze and in some cases, report against -- all of the data that they were collecting. That's why practices such as sampling came to be viewed as pragmatic necessities -- even if almost everyone conceded that they were inherently problematic, to say nothing of capricious.

“You don’t throw the whole data set in [to your data mining programme]. You have to choose which data you need, and you have to make sure that it’s [the right data], because if you don’t have [the right] data, your techniques might not work,” TDWI instructor Mark Madsen tells attendees of his predictive analytics seminars.  

“You can go out there and pull a very small percentage of your data ... and sample it for the probability of an event occurring,” he continued, “but where it will break down is [with] very rare and very infrequent events, and [that's where] it gets very hard to sample.”

Ideally, you'd want to identify all of these “rare” or “infrequent” events; they're the signatures of anomalies such as fraud, customer-specific churn and potential supply chain disruption. They're the high-value needles hiding in the undifferentiated haystacks of your data, and they've always been difficult to find.

Until now.

On the hardware side, there's plenty of horsepower: vendors have been shipping systems stuffed full of random-access memory -- and bristling with processors -- for half a decade or more.

True, the software technologies that today enable big data analytics took a bit longer to mature. And in the case of the open source software Hadoop framework -- which bundles a distributed file system, an implementation of the MapReduce algorithm, SQL- or DMBS-like access amenities and a variety of programming tools -- they're still maturing. But they're mature enough, much like sampling was once understood to be good enough.

IBM, Microsoft, Oracle and Teradata, along with most other prominent BI and data warehousing (DW) vendors, tout integration of some kind with Hadoop. Some even trumpet their own implementations of the ubiquitous MapReduce algorithm.

These vendors aren't just talking about big data, they're talking about big data in conjunction with advanced analytic technologies such as data mining, statistical analysis and predictive analytics. They're talking, in other words, about big data analytics.

According to TDWI Research, big data analytics hasn't just arrived; it's closing in on mainstream acceptance. More than one-third (34%) of respondents to a recent TDWI survey said they practiced some form of advanced analytics in conjunction with big data. In most cases, they're starting with low-hanging fruit -- for example, by eliminating practices like sampling.

“Big data provides gigantic statistical samples, which enhance analytic tool results,” Russom wrote. “Most tools designed for data mining or statistical analysis tend to be optimized for large data sets. In fact, the general rule is that the larger the data sample, the more accurate are the statistics and other products of the analysis,” he continued. “Instead of using mining and statistical tools, many users generate or hand-code complex SQL, which parses big data in search of just the right customer segment, churn profile or excessive operational cost. The newest generation of data visualization tools and in-database analytic functions likewise operate on big data.”

Dave Inbar, senior director of big data products with data integration specialist Pervasive Software, concurs. In fact, he said, if organisations aren't already thinking about phasing out sampling and other “artifacts” of past best practices, they're behind the curve.

“Data science is inherently diminished if you continue to make the compromise of sampling when you could actually process all of the data,” he said. “In a world of Hadoop, commodity hardware, really smart software, there's no reason [not to do this]. There were good economic reasons for it in the past, [and] prior to that, there were good technical [reasons]. Today, none of [those reasons] exists. [Sampling] is an artifact of past best practices; I think it's time has passed.”

TDWI instructor Madsen, a veteran data warehouse architect in his own right, agrees.

“Needle-in-a-haystack problems don't lend themselves well to samples, so you do things like overemphasize outliers in your training set, which can lead to problems,” he said. Madsen, who heads his own information management consultancy, notes that, “In the end, it's easier to run [the entire data set] than it is to work on [statistical] algorithms and sample and worry. There are techniques for data that can handle the problematic where distribution challenges arise, and which can trip up statistical methods.”

TDWI (The Data Warehousing Institute), in partnership with IRM UK, will present the TDWI BI Symposium at the Radisson Blu Portman, in London, 10-12 September, 2012. is a media sponsor.

Read more on Business intelligence and analytics