Over the past few years, the open source technology Apache Hadoop has become quite popular amongst BI and DW professionals. In this tutorial, we will explain the concept by answering some of the frequently asked questions (FAQ) on Hadoop.
Special Hadoop resources
Apache Hadoop is instrumental in analyzing big data within the enterprise. Learn about big data, Hadoop and middleware trends in this learning guide.
Free enterprise cloud software from Hadoop is causing a lot of buzz in the market, and the economics of the public cloud are leading customers to more mid-sized provider.
In this guide:
- What is Apache Hadoop?
- Who supports and funds Hadoop?
- Why is Hadoop popular?
- Where does Hadoop find applicability in business?
- What are the enterprise adoption challenges associated with Hadoop?
- How has Hadoop evolved over the years?
- Companion projects to Hadoop
- Further reading
Apache Hadoop is a free, Java-based programming framework designed for parallel processing of large amounts of data in a distributed computing environment. Hadoop can scale up, in a fault-tolerant manner, from one machine to thousands. This scalability means the individual machines in the processing cluster can be inexpensive, while the cluster itself stays resilient. With Hadoop, applications can process petabytes of data on thousands of processing nodes.
Hadoop is one of the projects of the Apache Software Foundation. The main Hadoop project is contributed to by a global network of developers. Sub-projects of Hadoop are supported by the world’s largest Web companies, including Facebook and Yahoo.
Hadoop's popularity is partly due to the fact that it is used by some of the world's largest Internet businesses to analyze unstructured data. Hadoop enables distributed applications to handle data volumes in the order of thousands of exabytes.
Hadoop, as a scalable system for parallel data processing, is useful for analyzing large data sets. Examples are search algorithms, market risk analysis, data mining on online retail data, and analytics on user behavior data.
Hadoop's scalability makes it attractive to businesses because of the exponentially increasing nature of the data they handle. Another core strength of Hadoop is that it can handle structured as well as unstructured data, from a variable number of sources.
- To many enterprises, the Hadoop framework is attractive because it gives them the power to analyze their data, regardless of volume. Not all enterprises, however, have the expertise to drive that analysis such that it delivers business value.
- Scaling up and optimizing Hadoop computing clusters involves custom coding, which can mean a steep learning curve for data analytics developers.
- Hadoop was not originally designed with the security functionalities typically required for sensitive enterprise data.
- Other potential problem areas for enterprise adoption of Hadoop include integration with existing databases and applications, and the absence of industry-wide best practices.
Hadoop originally derives from Google's implementation of a programming model called MapReduce. Google's MapReduce framework could break down a program into many parallel computations, and run them on very large data sets, across a large number of computing nodes. An example use for such a framework is search algorithms running on Web data.
Hadoop, initially associated only with web indexing, evolved rapidly to become a leading platform for analyzing big data. Cloudera, an enterprise software company, began providing Hadoop-based software and services in 2008.
In 2012, GoGrid, a cloud infrastructure company, partnered with Cloudera to accelerate the adoption of Hadoop-based business applications.Also in 2012, Dataguise, a data security company, launched a data protection and risk assessment tool for Hadoop.
More Hadoop resources
The Apache Software Foundation maintains several companion projects to Hadoop:
Apache Cassandra is a database management system designed for large amounts of data. Key aspects are fault tolerance, scalability, Hadoop integration, and replication support.
HBase is a non-relational, fault-tolerant distributed database designed for storing large amounts of sparse data.
Hive is a data warehousing system for Hadoop that enables easy data summarization.
Apache Pig consists of a high-level language to create data analysis programs, along with the infrastructure for evaluating those programs.
Apache ZooKeeper is a centralized service for distributed applications. It maintains configuration information and provides a naming registry, distributed synchronization, and group services.
Chukwa is a data collection system that monitors large distributed systems; it includes a toolkit for analyzing the results.
The Apache Mahout project aims to produce free implementations, on the Hadoop platform, of scalable machine learning algorithms.