HP Haven Predictive Analytics: operationalising large-scale machine learning

HP wants its new Haven Predictive Analytics product to be viewed as a route to operationalising large-scale machine learning and statistical analysis for today’s big data volumes — the technology is powered by HP’s Distributed R programming language offering.

But isn’t that all a bit of a mouthful?

Let’s break it down.

Why predictive analytics and predictive modeling?

Because determining future outcomes and trends from existing data sets (potentially) allows firms to predict everything from customer buying behaviour to fraud detection to industrial plant machine downtime.

Why Distributed R?

Because distributed R is R itself, with new language extensions and a runtime to manage distributed execution i.e. in bigger enterprise environments.

Why is operationalising big data volumes a big deal?

Because none of this technology is easy from the get go, so HP is trying to kick start its use with out-of-the-box-algorithms (yes, sorry, that is a thing) as a set of proven parallel algorithms that produce accurate and consistent (so says HP) results with mature standard R algorithms.

The software itself enjoys native integration with the HP Vertica columnar massively parallel processing (MPP) database, which is supposed to increase overall data access performance and allow software application development professionals to start building software with predictive analytics inside.

ODBC parallel data loaders for dummies

Shilpa Lawande, GM of platform at HP’s Software Big Data Business Unit suggests that when HP Distributed R is deployed with HP Vertica, overall data access performance is boosted by as much as five times over standard R ODBC (open database connectivity) parallel data loaders. According to a press statement, “Since Vertica fully supports industry-standard SQL queries, it enables a much broader community of developers and DBAs to employ the power of predictive analytics without the burden of learning an entirely new technology or tool.”

HP reminds us that the open source R language is used by “millions of data scientists around the globe” to interpret, interact with and visualize data. It has been a powerful tool in tackling predictive modeling tasks such as drug discovery and financial modeling.

“Unfortunately, due to its inherent design, it has been challenged to process large data sets. HP worked out of HP Labs and HP Software to create its Distributed R extension and the result of this strategic initiative is the industry’s first open source version of a distributed platform for R that is explicitly designed to address today’s demanding Big Data predictive analytic tasks,” said the company, in a press statement.

Comfortable warm R feelings

Now the global developer community can employ R to scale to more than a billion predictive records of data – and this is said to be ‘an order of magnitude improvement’ over traditional R-based performance. This offering from HP also retains the consistency with R and enables data scientists to use their familiar R console and RStudio to work with Distributed R — and this could indeed be important for R converts.