Hadoop - is the elephant packing its trunk for a trip into the mainstream?

This is a guest blog by Zubin Dowlaty, head of innovation and development at Mu Sigma.


Hadoop, the open-source software platform for distributed big data computing, has been making waves over the recent past. The IPO of HortonWorks in December 2014 contributed to that, and the stock market ambitions of the other two main distributors of Hadoop, Cloudera and MapR, have also been fanning the flames. Getting these big data technology companies trading as public institutions will create greater confidence in the technology. The increased funding levels will signal that these technologies are now proven, boosting their uptake.

A Schumpeter creative wave of technology destruction is occurring in the analytics space right now, triggered by Hadoop. It is quite amazing to witness the speed with which this is occurring. Larger enterprises are now eyeing it up for their corporate infrastructure; the technology has been set en route to becoming more accessible to business users rather than just data scientists. However, to exploit this opportunity, enterprises need to be willing to adopt a different mindset.

En route to an enterprise-scale solution?

Over the last year, the industry has seen widespread deployment of Hadoop and associated technologies across many verticals. Furthermore, significant momentum has started building in the enterprise segment, with Fortune 500 companies taking Hadoop more seriously as a data-operating platform for an enterprise-scale and -grade applications. Companies of this size have the muscle to take the technology from the ‘early adopter’ to ‘early majority’ stage and beyond, creating a network effect: as more – and more significant – companies implement Hadoop, others follow.

From the Hadoop solution perspective, the technology stack using Hadoop 2.0 and YARN is the critical technology component that has enabled Hadoop to become more of a general OS or computing platform for an analytics group, and not just a niche computing tool.

Technologies such as Apache Spark, Impala, Solr, and STORM, plugged into the YARN component model, have accelerated adoption for running real-time queries and computation. Technologies like ParAccel, Hive on Tez, Spark SQL, Apache Drill from a range of vendors have been created to support data exploration and discovery applications. SQL on Hadoop is another area which has seen a lot of traction in terms of development.

SPARK stands out as it has given the data science community a programming framework for creating algorithms that run more quickly compared to other technologies. It has come a long way to be considered as the new open standard in Hadoop and with robust developer support it is expected to become the de-facto execution engine for batch processing. Batch MapReduce is slow for computation but great for handling big data. With SPARK, data scientists will have fast in-memory capabilities for running algorithms on Hadoop clusters.

Governance and security for Hadoop clusters is still evolving, but these areas have progressed and the main vendors have recognized them as weaknesses, so they can be expected to improve in the short to medium term.

Wringing a lot more ROI for business people

In 2015, apart from scaling their Hadoop initiatives, companies will also be looking for the return on their data and infrastructure investments.

From a technology perspective, YARN will continue to gain momentum as it can support additional execution and programming engines other than MapReduce. Given the flexibility it brings to the table, it will help build more big data applications for better consumption by business users rather than just data scientists.

Analytical applications leveraging concurrent analysis will push analysts to adopt real-time or near real-time computation over the traditional batch mode.

Adoption of scalable technologies in storage, computing and parallelization will increase as more and more machine-generated data becomes available for analysis. Current BI, hardware and analytics-led software architectures are not suitable for scale. They will need to be revisited and carefully thought through. The industry is looking out for standards in this area, and a unified platform that offers an end-to-end solution.

Toolset, skillset, mindset

When it comes to the adoption of advanced technologies such as Hadoop, an organization can acquire toolsets and skillset over a period of time but the largest challenge lies in changing the mindset of the enterprise community as it is deeply ingrained.

For example, large organizations are still struggling with the need to shift from central Enterprise Data Warehouse frameworks towards more distributed data management structures. Similarly, deep-seated trust in paid solutions needs to give way to greater adoption of open source models and technologies, which are now very mature.

It is important to move away from the current 1980s technology and application mindset, and truly scale up in order for enterprise end users to reap the full benefits of Big Data insights and make better decisions. A holistic approach bringing math, business and technology together within a ‘Man-Machine’ ecosystem would be the key to achieving it.

Think scale, think agility, think continuous organizational learning – that is what technologies like Hadoop can make possible.