Computerworld

Comparing the top Hadoop distributions

Hadoop introduced a new way to simplify the analysis of large data sets, and in a very short time reshaped the big data market. In fact, today Hadoop is often synonymous with the term big data.

Since Hadoop is an open source project, a number of vendors have developed their own distributions, adding new functionality or improving the code base. This article by Altoros, a big data specialist, provides an overview of the major distributions, describing how they differ from the standard edition.

A standard open source Hadoop distribution (Apache Hadoop) includes:

The Hadoop MapReduce framework for running computations in parallel

The Hadoop Distributed File System (HDFS)

Hadoop Common, a set of libraries and utilities used by other Hadoop modules

This is only a basic set of Hadoop components; there are other solutions -- such as Apache Hive, Apache Pig, and Apache Zookeeper, etc. -- that are widely used to solve specific tasks, speed up computations, optimize routine tasks, etc.

Vendor distributions are, of course, designed to overcome issues with the open source edition and provide additional value to customers, with a focus on things such as:

Reliability. The vendors react faster when bugs are detected. They promptly deliver fixes and patches, which makes their solutions more stable.

Support. A variety of companies provide technical assistance, which makes it possible to adopt the platforms for mission-critical and enterprise-grade tasks.

Completeness. Very often Hadoop distributions are supplemented with other tools to address specific tasks.

In addition, vendors participate in improving the standard Hadoop distribution by giving back updated code to the open source repository, fostering the growth of the overall community.

Three of the top Hadoop distributions are provided by Cloudera, MapR and Hortonworks. The chart below illustrates the results of the market research Big Data Vendor Revenue and Market Forecast 20122017. It compares the revenue of these major Hadoop vendors in 2012.

While Cloudera and Hortonworks claim they are 100% open source, MapR adds some proprietary components to the M3, M5, and M7 Hadoop distributions to improve the frameworks stability and performance.

Along with Cloudera, MapR and Hortonworks, Hadoop distributions are available from IBM, Intel, Pivotal Software, and others. These distributions may even be shipped as a part of a software suite (e.g., IBMs distribution), or designed to solve specific tasks (e.g., Intels distribution optimized for the Xeon microprocessor).

Key features of three popular Hadoop distributions

The values in the cells of the table refer to the versions of the corresponding components available in a particular Hadoop distribution. For performance comparisons, see our Hadoop Distributions: Cloudera vs. Hortonworks vs. MapR study.

There was only one NameNode, which managed the whole cluster. It was dealing with all metadata operations and stored metadata in RAM. With scalability limited to approximately 4,000 nodes and 40,000 tasks, this node was a single point of failure.

It was impossible to update Hadoop components on some of the nodes.

The MapReduce paradigm can be applied to only a limited type of tasks.

There were no other models (other than MapReduce) of data processing.

Resources of a cluster were not utilized in the most effective way.

While most distributions were developed to address the limitations, they did not introduce any significant architectural changes compared to the open source version. Thats what made Hadoop 2.0 a real breakthrough when it emerged in 2013. In particular, it features YARN (Yet Another Resource Negotiator), a new cluster management system that turns Hadoop from a batch data processing solution into a real multi-application platform. The updated version eliminated the following issues:

Vulnerability of a system with a single NameNode (a single point of failure)

The possible number of nodes in a cluster was greatly increased.

YARN extends the number of tasks that can be successfully solved with Hadoop

The figure below illustrates the multi-application principle implemented in Hadoop 2.0, and shows that YARN is actually a layer between HDFS and data processing applications.

The main idea of YARN is to split up two major tasksresource management and schedulinginto two separate concepts. YARN has a central ResourceManager and an ApplicationMaster, which is created for each application separately. This approach allows for running batch, interactive, in-memory, streaming, online, graph, and other types of applications simultaneously. The figures 3 and 4 below demonstrate the architectural differences in the two Hadoop versions.

Hadoop 1.0 had a single JobTracker, which had to deal with thousands of TaskTrackers and MapReduce tasks. This architecture limited scalability options and enabled a cluster to run a single application at a time.

Hadoop 2.0 has a single ResourceManager and multiple ApplicationMasters. Since each application is managed by a separate ApplicationMaster, it is no longer a bottleneck in a cluster. As stated in the notes from the Hortonworks development team, they were able to simulate 10,000 node clusters composed of modern hardware without significant issue. Separation of cluster management tasks from an application life cycle resulted in greatly improved cluster scalability.

At the same time, with a global ResourceManager, YARN provides much better resources utilization, which also adds to spinning up a cluster. YARN allows for running different applications that share a common pool of resources. There are no pre-defined Map and Reduce slots, which helps to better utilize resources inside a cluster.

The ability to run non-MapReduce tasks inside Hadoop turned YARN into a next-generation data processing tool. Hadoop 2.0 features additional programming models, such as graph processing and iterative modeling, which extended the range of tasks that can be solved using this tool.

In addition, were expecting to see rapid growth of applications that rest on YARN in the near future. Apache Giraph (for analyzing graphs, e.g. social connections on Facebook), Spark (machine learning and data mining), Apache HAMA (machine learning and graph algorithms), Storm (unbounded streams of data in the real-time), and others are adjusting to the new architecture.

Hadoop distributions tomorrow

There are several trends shaping the evolution of Hadoop distributions:

* YARN adoption. Hadoop 2.0 supports larger clusters, which enables running more computations simultaneously. It received a new cluster management system that fits a broader range of tasks, including support for more flexible data processing and consolidation algorithms. Therefore, Cloudera and Hortonworks were actively adopting it through 2013. Since MapR used some proprietary components in its distribution, it needed a bit more time. The release of MapR 2.0 that supports YARN is scheduled for March 2014. While it still uses its own file system instead of the default HDFS, it seems like this vendor probably shifted to a wider usage of open source Hadoop code, since it offers now more support for different open Hadoop components.

* Third-party integration for data consolidation. Hadoop distributions are integrating with third-party solutions for analyzing data. Cloudera, for instance, added connectors for binding CDH (Clouderas Distribution Including Apache Hadoop) with data analysis and reporting systems, such as Oracle, Tableau, Teradata, etc. CDH supports Talend Open Studio for Big Data, an easy-to-use graphical environment that allows developers to visually map big data sources and targets without the need to learn and write complicated code. This tool contains 450+ connectors for getting data from a variety of data sources.

* Significant performance improvements. Cloudera recently announced Spark support. With in-memory computations, this model can greatly speed up data processing, up to 100x in some cases. Hortonworks is also working on improving computing speed. The company initiated Stinger, a project that is aimed at making Apache Hive queries up to 100x faster. It is also working to optimize stored data to speed its processing.

Apache Drill, a project backed by MapR, aims at solving similar tasks. It is based on the model published by Google in the white paper Dremel: Interactive Analysis of Web-Scale Datasets. However, the project is quite new and may not be ready for production deployments.

Pivotal Software delivered PivotalHD, a Hadoop distribution that features HAWQ, a proprietary component able to process SQL-like queries 318x faster than Hive. Unfortunately, there is no independent evaluation that can prove these results.

If you are interested in third-party performance benchmarks of similar systems with Massively Parallel Processor architectures, you can check the figures from AMPLab Berkeley.

* Data security. Obviously, Hadoop vendors will be working harder to improve security of data access, restrict permissions, and address a wider range of data protection issues.

* Extending functionality for specific tasks. Companies that offer Hadoop distributions are always looking to add modules that introduce new capabilities to the framework. Clouderas distribution, for instance, contains full-text search and Impala, an engine for real-time processing of data stored in HDFS using SQL queries. Hortonworks has added support for SQL semantics in the Stinger Initiative and is developing Apache Tez, a new architecture that would help to accelerate iterative tasks by eliminating unnecessary tasks and improving write/reads to/from HDFS. Wandisco provides cross-data center replication with its Non-Stop Hadoop technology.

Conclusions

Today, Hadoop is not only an integral part of the big data ecosystem but is a central force that gave a new start to the set of related tools. Though the adoption of Hadoop 1.0 for enterprise systems was limited to particular types of workloads, the situation will change with YARN.

The new architecture extends the range of cases that can be addressed with Hadoop. If used together with Storm, for instance, it would accelerate processing unbounded streams of data; in combination with SPARK, it would foster data analytics initiatives; and with Tez, it would make iterative algorithms work much faster.

This article overviews only the trends of the ecosystem and does not compare performance. It is still hard to find the performance results for real-life YARN clusters based on Hadoop distributions. Exhaustive figures are available for Hadoop 1.0 only: here (already updated with Hive on HDP 2.0) and here (Hortonworks vs. Cloudera vs. MapR).

The reason is simple. In case of Cloudera, the new architecture is still in beta; MapR has scheduled its 2.0 release for March 2014. Most of other vendors are also in the development process. So, it will be interesting to compare the performance of Hadoop 1.0 vs. 2.0 in action and find out how the difference affects the overall cluster built on top of a Hadoop distribution.

Grigorchuk is a Director of R&D at Altoros, a company that focuses on accelerating big data projects and platform-as-a-service enablement. He is an author of multiple research projects on big data, distributed computing, mathematical modeling, and cloud technologies.