Hadoop on Windows Azure: Hive vs. JavaScript for processing big data

Comments

For some time Microsoft didn't offer a solution for processing big data in cloud environments. SQL Server is good for storage, but its ability to analyze terabytes of data is limited. Hadoop, which was designed for this purpose, is written in Java and was not available to .NET developers. So, Microsoft launched the Hadoop on Windows Azure service to make it possible to distribute the load and speed up big data computations.

But it is hard to find guides explaining how to work with Hadoop on Windows Azure, so here we present an overview of two out-of-the-box ways of processing big data with Hadoop on Windows Azure and compare their performance.

When the R&D department at Altoros Systems Inc. started this research, we only had access to a community technology preview (CTP) release of Apache Hadoop-based Service on Windows Azure. To connect to the service, Microsoft provides a Web panel and Remote Desktop Connection. We analyzed two ways of querying with Hadoop that were available from the Web panel: HiveQL querying and a JavaScript implementation of MapReduce jobs.

HOW-TO: Get Hadoop certified ... fast

IN PICTURES: 'The Human Face of Big Data'

We created eight types of queries in both languages and measured how fast they were processed.

A data set was generated based on US Air Carrier Flight Delays information downloaded from Windows Azure Marketplace. It was used to test how the system would handle big data. Here, we present the results of the following four queries:

Count the number of flight delays by year

Count the number of flight delays and display information by year, month, and day of month

Calculate the average flight delay time by year

Calculate the average flight delay time and display information by year, month, and day of month

From this analysis you will see performance results tests and observe how the throughput varies depending on the block size. The research contains a table and three diagrams that demonstrate the findings.

Testing environment

As a testing environment we used a Windows Azure cluster. The capacities of its physical CPU were divided among three virtual machines that served as nodes. Obviously, this could introduce some errors into performance measurements. Therefore we launched each query several times and used the average value for our benchmark. The cluster consisted of three nodes (a small cluster). The data we used for the tests consisted of five CSV files of 1.83GB each. In total, we processed 9.15GB of data. The replication factor was equal to three. This means that each data set had a synchronized replica on each node in the cluster.

The speed of data processing varied depending on the block size -- therefore, we compared results achieved with 8MB, 64MB and 256MB blocks.

The results of the research

The table below contains test results for the four queries. (The information on processing other queries depending on the size of HDFS block is available in the full version of the research.)

Brief summary

As you can see, it took us seven minutes to process the first query created with Hive, while processing the same query based on JavaScript took 50 minutes and 29 seconds. The rest of the Hive queries were also processed several times faster than queries based on JavaScript.

To provide a more detailed picture, we indicated the total number of Map and Reduce tasks. As you can see in Figure 2, the first Hive query produced 37 Map tasks and 10 Reduce tasks. The JavaScript query generated 150 Map tasks and a Reduce task.

This can be explained by the fact that JavaScript is not a native language for Hadoop, which was written in Java. Hive features a task manager that analyzes the load, divides the data set into a number of Map and Reduce tasks, and chooses a certain ratio of Map tasks to Reduce tasks to ensure the fastest computation speed. Unfortunately, it is not clear yet how to optimize the JavaScript code and configure the task manager, so that it uses available resources in a more efficient manner.

Such a great difference in performance may also have another explanation. The results of the JavaScript query are written to the outputFile of the runJs command ("codeFile," "inputFiles," "outputFile") using a single Reduce task, as indicated in the table above.

Dependency between the block size and the number of Map tasks

We have also analyzed how the size of a block in a distributed file system influenced the number of Map tasks triggered in Hive and JavaScript queries.

For a 64MB block, the HQL query ran 37 Map tasks and 10 Reduce tasks. When a JavaScript query was processed, the task manager divided the total load into 150 Map tasks and a Reduce task.

Referring to the table, we can conclude that the number of Reduce tasks does not depend on the block size and is equal to 10 for Hive queries and to 1 for JavaScript queries.

Dependency between performance and the number of Map/Reduce tasks

We also analyzed how the number of Map and Reduce tasks influenced the speed of processing Hive and JavaScript queries.

From this diagram, you can see that Hive queries were properly optimized and the block size had almost no impact on execution time. In JavaScript, on the contrary, the processing speed depended directly on the number of Map tasks.

Dependency between performance and the type of a query

Below you can see the diagram that shows how the processing speed depends on the query type for a data set of 64MB.

The difference between the first and the second, as well as between the third and the fourth, queries was in the number of grouping parameters. The first query calculated flight delay times by year. In the second query, we added such parameters as month and day. The third query returned the average flight delay times by year, which is a different arithmetic operation. The fourth query calculated the average flight delay times by year, month and day.

Judging by the diagram, additional grouping parameters had much greater influence on JavaScript queries than the performed arithmetic operations. In case of Hive, such operations as transforming, converting and computing data caused the processing speed to degrade significantly, which can be seen from the difference in processing time between the first and the third queries. The first query calculated average values and included four grouping parameters, which resulted in the slowest processing speed.

Conclusion

When we started this analysis, only the community technology preview release was available. Hadoop on Windows Azure had no documentation and there were no manuals showing how to optimize JavaScript queries. On the other hand, HiveQL had been built on top of Apache Hadoop long before Microsoft offered its solution. That is why Hive is much faster when performing basic operations, such as various selections or doing various data manipulations like creating/updating/deleting, random data sampling, statistics, etc. However, you would have to opt for JavaScript for more complex algorithms, such as data mining or machine learning, anyway, since they cannot be implemented with Hive.

In short, this article demonstrates the results of processing just four queries with the size of an HDFS block equal to 64MB. The full version of the research includes two additional tables with performance figures for HDFS blocks of 8 and 256MB. It also features time spent on processing other four queries. The full version of the research is available at www.altoros.com/hadoop-on-azure.

In October 2012 Microsoft released a new CTP version of this service, which is now called Windows Azure HDInsight. Some of the issues we mentioned before were fixed, since the improvements included:

updated versions of Hive, Pig, Hadoop Common, etc.

an updated version of the Hive ODBC driver

a local version of the HDInsight community technology preview (for Windows Server)

guides and documentation describing how to use the service

Now Microsoft offers a browser-based console that serves as an interface for running MapReduce jobs on Azure. The implementation of Hadoop on Windows Azure also simplifies installation, configuration and management of a cloud cluster. In addition, the updated platform can be used with such tools as Excel, PowerPivot, Powerview, SQL Server Analysis Services and Reporting Services. There is also the ODBC driver that connects Hive to Microsoft's products and helps to deal with business intelligence tasks. Such a solution that would enable .NET developers to process huge amounts of data fast was long awaited.

Although this article describes the out-of-the-box querying methods, the .NET community is contributing to .NET SDK for Hadoop. Currently, the 0.1.0.0 version is available to the public at CodePlex. This library already enables developers to implement MapReduce jobs using any of the CLI Languages -- the solution comes with examples written in C# and F# -- and provides tools for building Hive queries using LINQ to Hive.

Therefore, soon .NET developers will be able to build native Hadoop-based applications, employing other libraries that conform to the Common Language Infrastructure. This SDK will be an even more efficient tool for in-depth data analysis, data mining, machine learning, and creating recommendation systems with .NET.

About the authors:

Andrei Paleyes has 5+ years of experience in MS .NET-related technologies applied in large-scale international projects. Having a master's degree in mathematics, he is interested in big data analysis and implementation of mathematical methods used in data mining. He is a knowledge discovery enthusiast and presented a number of sessions on data science at local conferences. Recently, Andrei participated in architecture development of the analytical cloud-based platforms for genome sequencing and energy consumption solutions.

Sergey Klimov is proficient in developing large-scale applications and corporate systems, as well as processes automation using MS .NET and cloud technologies. He has degrees in software engineering and technical automatics. Sergey focuses on projects that require processing large volumes of data using Hadoop and cloud technologies, in particular Windows Azure.

Read more about data center in Network World's Data Center section.