Pentaho: don't get blinded with (data) science

At the Hadoop Summit in San Jose this week… open source Business Intelligence company Pentaho is announcing what it calls ‘Data Science Packs’ for developers and data scientists.

The Data Science Pack aims to help productivity by executing advanced descriptive statistics and machine learning algorithms (at scale) inside of data flow transformations.

What are descriptive statistics?

According to a University of Leicester paper, “Descriptive statistics are used simply to describe the sample you are concerned with — they are used in the first instance to get a feel for the data, in the second for use in the statistical tests themselves… and in the third to indicate the error associated with results and graphical output.”

Pentaho says that the Packs streamline the hard, time-consuming process of using R and Weka to prepare, clean and orchestrate data in Pentaho Data Integration (PDI) for analysis.

So what? – and anyway, what are R and Weka?

The “R” programming language and “Weka” machine learning algorithms for predictive analytics are two of the most popular data science tools out there (source: O’Reilly Data Scientist Salary Survey).

Unfortunately they require specialist technical skills that many companies outside Silicon Valley don’t have in house and they are time-consuming.

Ventana Research just estimated that a whopping 60-80 percent of time spent on a big data analytics project is spent on preparing data using tools like R and Weka.


According to Pentaho, “By slashing that time, those responsible for data analysis can devote more time to the ‘value added’ stuff and less time on boring (but important) administrative hygiene tasks and just get things done a lot faster.”

But why do this anyway?

(1.) Mainly because Pentaho tells us that its customers have been asking for this.

(2.) The company says it isn’t focused on ‘eye candy’, so delivering these kinds of tools is part of Pentaho’s strategy to make the hardest, least sexy and most important aspects data analytics fast and easy.

The Data Science Pack is, then, essentially, a toolkit to operationalise the commonly used R and Weka technologies.

According to the Ventana Research Big Data Analytics Benchmark Research, the top two time-consuming big data tasks are solving data quality and consistency issues (46%) and preparing data for integration (52%).


“Having built blueprints for the four most popular big data use cases, we know advanced and predictive analytics are core ingredients for success,” said Christopher Dziekan, EVP and chief product officer at Pentaho.

“The highest value of insight comes from having foresight blended with hindsight to drive insight and action. The Pentaho Data Science Pack allows organizations to apply their deep domain expertise and improve their customer analytics and predictions,” he added.