Why Linux is the powerhouse for big data

This is a contributed posting for the Computer Weekly Open Source Insider blog by Peter Linnell, Linux Engineer at SUSE.

As the hype and competition for big data analysis continues to grow, today’s data scientist has a vast array of tools and technologies at their disposal.

Linnell argues that many of these tools have a largely unknown commonality — they’re powered by open source solutions and communities.

From the engineer’s workbench…

Along with the Hadoop community, big data solution providers such as SAP HANA, Hortonworks, WANDisco Cloudera, Intel, InterSystems Corporation and Teradata use Linux as the underlying platform of their Big Data solutions.

Why should a data scientist care?

Each scientist has highly specialised needs that demand an open and powerful environment. Big data analysis needs computing that’s scalable, flexible and reliable – at a cost that won’t impact IT budgets immensely.

Economically, big data spreads massive amounts of data across a cluster of hardware to take advantage of the scaling out of compute resources.

Linux’s low barrier to entry allows for these clusters to be created at a fraction of the cost. It’s a familiar combination that made Linux the leader in high performance computing and high availability years ago.

NOTE: In addition to 97 percent of the TOP 500 supercomputers in the world, both the world’s fastest (Tianhe-2) and most famous (IBM Watson) supercomputers run on Linux.

Structurally, the open design of Linux allows the scalability for expanding amounts of computing power as needed, while open source architectures provide the flexibility to work together, allowing for computing resources to be pooled to harness large data intake. Big data systems need computing tools that are able to work together and Linux allows a variety of tools to do so harmoniously.

Historically, it was evident that the presence of a strong and free operating system made it possible to build the initial tools for virtualisation, cloud, and big data.

As adoption increased, it made for a natural continuation to run those same tools on the Linux platform. History also shows that community-based projects are tough to out-innovate. A community-developed solution comes with world-class support and stability and the open source community will continue to lead the cutting edge of big data projects.