Enterprise Hadoop: Big data processing made easier

Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offs

Comments

It's been a big year for Apache Hadoop, the open source project that helps you split your workload among a rack of computers. The buzzword is now well known to your boss but still just a vague and hazy concept for your boss's boss. That puts it in the sweet spot when there's plenty of room for experimentation. The list of companies using Hadoop in production work grows longer each day, and it probably won't be long before "Hadoop cluster" takes over the role that the words "crazy supercomputer" used to play in thriller movies. The next version of the WOPR is bound to run Hadoop.

The area is flourishing as the core project attracts a wide collection of helper projects that organize the workload and make it simpler to manage a collection of jobs to run at particular times. There's HDFS, a standard file system that can organize the data spread out around the cluster; Hive, a data warehousing layer for making sense of this data; Mahout, a collection of routines for trying to learn something from said data; and ZooKeeper, a tool for keeping all of the balls in the air. At least a half-dozen or more other open source tools live in a stable orbit around Hadoop.

[ Explore the current trends and solutions in BI with InfoWorld's interactive Business Intelligence iGuide. | Read about InfoWorld's 2012 Technology of the Year Award winners. | Read about InfoWorld's top 10 emerging enterprise technologies. | Discover what's new in business applications with InfoWorld's Technology: Applications newsletter. ]

The open source projects are just the beginning -- a surprisingly large number of companies are emerging with the plan of helping people actually use Hadoop. Some are just selling support, and others are building their own tools that sit alongside Hadoop and make it easier to use.

This kind of competition is usually seen as open source at its best. There is a core collection of packages that serve like a standard to keep everyone in synchrony. Each of the groups is competing to add the right sauce that will attract customers, both paying and nonpaying. There continues to be controversy over just how much is rolled into the central collection, as there can be in any major open source project, but the amount of experimentation is so large that it's hard to be too focused on the amount of sharing.

To get a feel for the excitement, I took four major collections out for a test-drive. I powered up a cluster of nodes on Rackspace, installed the tools, pushed the buttons, and ran some sample jobs. It's getting to be surprisingly easy to spend a few pennies for an hour or two of machine time -- so much so that I found myself debating whether it was worth leaving my cluster idling over lunchtime. Lest anyone doubt the efficiency of cloud computing, I noticed that the rate for my cluster of relatively fat machines with 4GB of RAM was less than the cost to park a car around the corner. The parking meters spin faster.

The not-so-good news is that these collections are far from perfect. None of the tools I tried worked exactly as promised. There were always small glitches. I often found myself reading the log files and paging through endless lists of Java stack dumps. (Someone is going to have to apply Hadoop to analyzing the endless stack dumps. They're getting so involved that I doubt a single machine will be able to parse them for much longer.) After a few seconds, I could usually get things on track again. These tools may not require someone with much experience to use once they're running, but they can't be installed unless you're fairly adept with the ways that the Java stack is organized.

Despite these impediments, I spent most of my time churning through data. The good news is that all of these tools make it pretty easy to get a cluster of computers working together to solve problems. Using these tools is much easier than downloading and configuring the source code yourself. They're designed to be one-button applications, and they come close to achieving that goal.

Amazon Elastic MapReduce It should be no surprise that Amazon, one of the pioneers of cloud computing, offers a mechanism for spinning up Hadoop clusters on its EC2 cloud. Elastic MapReduce is tightly integrated with all of Amazon's other elastic offerings, and it sits as another tab on the Amazon Web Services main page. You store your data in S3, then fire up a job to churn through it.

The integration is nicely done. Amazon provides a Java-based Web interface that does a great job of hand-holding, taking care of many of the glitches that often occur when you're first trying software. When it wanted to store data in an S3 bucket, it flipped me over to a page for creating the bucket.

If the Web GUI is a bit too babyish, there's also a classic Web service API that's been wrapped up in software by a number of other programmers. I played a bit with a Ruby-based collection of tools that submits the jobs and starts them running. The standard start and end is the S3 cloud.

With Elastic MapReduce, Amazon is essentially offering nicer packaging on top of EC2 for those who are willing to plunge deeper into Amazon Web Services. I could have built my own cluster of machines on EC2 and used any of the Hadoop distros to spin them up, but Elastic MapReduce offers a nice set of shortcuts. Amazon has already built and integrated the infrastructure, and you just push the buttons to choose which version of Hadoop (0.18 or 0.2) you want to use. There's no need to worry about which version of Linux is running underneath.

The infrastructure is quite nice. You can choose to pay a stock price for your machines or just bid for empty machines on the spot market. This is the kind of extra feature that thrills the free-market fans, but I found it confusing. You choose your bid and take your chances. If you bid too little, you could end up waiting a long time, perhaps even forever.

It should be noted that the cloud doesn't respond instantaneously. It took from 5 to 18 minutes to execute tiny jobs that would take microseconds to execute on a fully configured cluster in your own server closet. The overhead wouldn't make a difference for a big job, but it's not the same as having your own cluster waiting patiently for you to push the Start button.

Taking advantage of all of these features means buying into Amazon's storage system. If you're already using S3 for your data, you'll be ready to go. If you're not, you'll have to make some decisions. Some people find that S3 is too expensive for bulk data that's rarely accessed. You're paying for all of the engineering that's been built for people who need a fairly good response time, and that price is built into the retrieval costs.

I think all of Amazon's extra features are good options for two classes of users. If you already have most of the relevant data in Amazon's cloud, Elastic MapReduce makes it easy to spin up jobs to analyze it. The piping is already well in place.

The other group would be those who don't need a cluster most of the time but want to do short, intensive calculations once a week, once a month, or once a quarter. It's not much work to create a full Hadoop cluster using the other tools in this review, but it's kind of silly to request new machines from scratch every now and again. Amazon offers a nice shortcut to uploading a Python script or a JAR file and going straight to computation.

Cloudera CDH, Manager, and EnterpriseCloudera is a startup that has collected Hadoop experts from all of the major companies using Hadoop. The CTO came from Yahoo, the chief scientist from Facebook, and the CEO from Oracle. The staff is filled with the names of people who learned Hadoop by building it.

The company is selling training, support, professional services, and some tools for managing your cluster. The Cloudera distribution and basic manager are free for clusters with fewer than 50 machines, while the subscription-based enterprise edition offers many more features for handling standard data formats.

The free version is quite useful for starting up a cluster and monitoring the jobs as they flow through the system. The manager takes a list of IP addresses, logs into all of them with SSH, and installs the major tools.

The automation makes it pretty easy to run the Cloudera distro, but I still had to patch a few glitches to install it on CentOS. One component wanted a certain version of zip, and it ground to a halt until I logged into the machines and installed it myself. At another point, the Web-based graphical user interface wouldn't work until I logged in again and installed a widget library, ExtJS. The open source licenses probably weren't compatible.

The logging in reminded me of a small point. The IBM installer can use a different root password for each machine. Cloudera's installer wants to use either the same root password or the same RSA key. This meant I had to log into all of the machines and change the password because I was using a stock version of CentOS to start up the rack.

The fact that I noticed this small point and remembered it says much about what is for sale here. The tools are open source and the companies are selling ease-of-use. Little delays can multiply when you're not running exactly the same code.

I think Cloudera has done a better job of making its tools work with different Linux distros. It lists Ubuntu, Suse, Red Hat, CentOS, and Debian. Although I had to do a bit of patching with CentOS, it was relatively simple.

The difference between the free and enterprise versions is a bit bigger than I often see. The proprietary version will not only handle more than 50 machines, but it also includes plenty of monitoring, reporting, and data analysis tools.

In other words, the free version is a great way to start up a Hadoop cluster and make sure that everything is running, but you'll have to do some poking around to monitor it. The enterprise version includes more tools that automate the poking around and double-checking.

IBM InfoSphere BigInsights IBM bundles Hadoop into something it calls InfoSphere BigInsights. The word "Hadoop" is on the main page, but the advertising copy clearly suggests that this is a product to help people who want "deep insights" into "big data." It's a tool for data analysis that just happens to use Hadoop for all of the structure.

There are two tiers: basic and enterprise. The basic edition is available completely for free, but you can buy support if you like. The enterprise edition, available through a commercial license, includes a number of extra features like BigSheets, a spreadsheetlike tool for drilling down into the data sitting in the cluster.

The collection includes all of the usual suspects and a few that aren't always mentioned -- such as Lucene. Lucene makes sense because BigInsights includes more than a few mechanisms for taking apart text. There's an entire collection of TextExtractors that will do things like search for addresses and flag certain words. The meat of the text analytics is in the enterprise edition.

IBM's literature says the BigInsights package is for Linux, but I found that it ran smoothly with only Red Hat's Enterprise distribution. The installation script would limp to the finish with a few of the others I tried, but it often reported that it failed to install entire tools like Hive or Pig. Even CentOS wasn't close enough to get much running. I think it may still be possible to get BigInsights running if you're adept with Linux and happy to poke around the log files, but it achieves labor savings only if you're running Red Hat Enterprise.

There are several nice touches in this installation script, by the way. As I plowed along looking for a good distribution, the software was careful to remember all of my inputs, so it wouldn't need to be reconfigured each time. This should be useful in a cloud where people may try to spin up a cluster, then tear it down. The software also includes a number of little features, like the ability to remember a different root password for each node; these can be quite helpful.

The center of the IBM tool is a console that helps you set up some jobs and kick them off. It's completely browser-based -- like the install script -- and you can simply upload your JAR files directly through the Web browser. You can even drill down into the HDFS file system layer and read the results without leaving the browser.

The Web GUI is a big advance over using the command line, but I easily found a number of ways that the console in the basic edition could be improved. As far as I can tell, there's no way to delete the old jobs. The information for each job includes basic details about the start and stop time, but almost everything else is just dumped as raw text. It wouldn't be too hard to parse some of this and do a nicer job displaying the log information.

The monitoring is also rudimentary. You can see that the nodes in your cluster are running and the components have started, but you don't get any cool dials or widgets that show the load or the progress. If you ask for the "details" about a component, you get a popup with some Log4J lines related to that component. A Java programmer won't blink an eye, but others might find it spare and uninviting.

There are a number of better tools in the enterprise edition. The aforementioned BigSheets, a so-called spreadsheet running on Hadoop, will let you play around with the data in the Hadoop cluster just as you would experiment with the data in Excel. There's also a number of tools for connecting your cluster with other databases and data sources throughout the enterprise. The basic edition is good for trying out a pretty standard version of Hadoop, while the enterprise edition adds a slew of features that go far beyond the open source core.

MapR M3 and M5 Whereas Cloudera is run by folks who come from Hadoop strongholds such as Yahoo, MapR's corporate team is filled with people who hail from Google, EMC, Microsoft, and Cisco, companies with plenty of experience with big data sets, even if they're not steeped in Hadoop's traditional way of working with them.

The new talent is also bringing more sophistication to the stack. The MapR distribution of Hadoop includes a better version of the file system with snapshots, mirroring, and direct NFS access if you need it. MapR also offers a more resilient architecture that won't go down if the central controller locks up. MapR calls all of this "high availability" and charges for it.

MapR comes in two flavors: M3 and M5. Is there an M4? Apparently not, but that's marketing for you. The real distinction is between the free community edition (M3) and the proprietary version with all the extra, high-availability features (M5). While some of the other companies are effectively selling tools for monitoring and reporting, MapR is selling a more sophisticated layer under the hood. In other words, whereas the others are wrapping more features around the open source Hadoop, MapR is rebuilding it.

The value of this approach will depend on the seriousness of your job. If your Hadoop data is mission critical or simply has to be ready most of the time, you'll definitely be interested in the extra features for preserving the data and keeping up the cluster. But if you're processing log files and generating reports that can wait a few hours or even days, there's not much need for it. Restarting your cluster when the NameNode fails is kind of a pain, but not if you have the slack in your system to begin again.

The value will also depend on the nature of your calculations. If you do many short calculations, then restarting isn't a big problem. But if your jobs last hours, days, or especially weeks, the ability to store snapshots becomes more and more valuable.

There is a cost for some of these features. I couldn't install the M3 distribution on my cluster of machines in the Rackspace cloud because it requires access to "physical hard drives." In other words, the NFS code from MapR burrows fairly deeply into the file system to generate the performance gains. It can't work its magic with all of the layers of virtualization in some environments. This won't be an issue if you're using real machines with real disks, but it can be a roadblock in some of the new virtual worlds.

I ended up doing my testing with a VMware version that worked with MapR's version of Ubuntu.

I have to say that the direct access to NFS is a nice feature. Although it's always possible to move the data in and out of HDFS with the regular tools, it's much easier to integrate the system with a direct NFS link. I wouldn't be surprised if some errant tool occasionally introduces a bug because the data is not run through HDFS, but I'm guessing the occasional problems will be worth the trade-off.

It's clear that MapR is putting most of its effort into the code under the hood. The Web console for monitoring jobs is perfectly nice, but it lacks some of the gaudier features you'll find in other distributions. I even found myself kicking off jobs by typing "hadoop" into a command line. There's nothing missing that will get in the way of serious work, but the interface isn't as accessible for new users as some of the others.

MapR also has some interesting partnerships. Many have noted that EMC is almost certainly repackaging MapR and selling it as part of the Greenplum collection of big data analytics tools. This suggests that we're already starting to see these stacks disappearing inside of other packages.

Hortonworks Data Platform I wanted to test the Hortonworks distribution, but it wasn't ready when I was writing. The company will be concentrating on selling training and support while avoiding creating proprietary extensions.

"We are an open source company," Eric Baldeschwieler, the CEO, told me. "The only product we have is open source. We won't commit to never selling anything, but you won't see anything in the next year. We're committed to a complete open source, horizontal platform. We want people to be able to download everything they want for free. That differentiates us from everyone else in the market."

Indeed, the company employs a number of people with a deep knowledge of Hadoop gained from years at Yahoo. The company formally separated from Yahoo last year, and now it's looking for partnerships to support their work.

Hortonworks is currently running a private beta. I couldn't join it, but perhaps your company will be able to participate. In the meantime, you can grab Hadoop directly from Apache. It's guaranteed to be pretty close to what Hortonworks will be shipping, at least for the next year.

Choosing a Hadoop There's no easy way to summarize the quickly shifting space. Each of these companies is pointed in a slightly different direction. They may all agree that the Hadoop collection of software is a great way to spread out work over a cluster, but they each have different visions of who would want to do this and, more important, how to accomplish it. The similarities are fewer than you might expect.

The biggest differences may be in how you handle your data. The idea of making your data accessible through NFS may be one of the neatest innovations, but MapR is introducing some risk by breaking from the pack and adding its own proprietary extensions. MapR's claims for great speed and better throughput are tantalizing, but there's also the danger of bugs or mistakes appearing because of incompatibility. Just as in horror movies, bad things can happen when you split up and strike off on your own.

Amazon's system also imposes its own limitations. It's easiest if you've already decided to park your information in S3. If you've decided that this is a good place for it, you won't notice. If you're not so sure, you'll have to adapt.

Much will also depend on how you're using your data. I think Amazon's cloud may be the simplest way to knock off fast jobs that are run occasionally, but it's not the only choice. Both IBM and Cloudera make it relatively easy to set up and run a cluster. After doing it a few times, I found I could knit together a small cluster of Rackspace cloud machines in just a few minutes. It's not simple, but it's not too hard either.

My guess is that most folks will want to use Cloudera, IBM, or MapR for permanent clusters in permanent clouds. Although it's tempting to spin up a rack for a bit of work, it probably makes more sense to leave it up and running just to simplify the process of migrating the data. I suspect that most Hadoop work involves more data juggling than raw computation. Leaving the cluster up and running makes it possible to move the data in and out.

Another big difference is how the companies are adding features for processing different types of data. IBM includes Lucene, a text search engine for building indices. Cloudera offers intelligent log search. I think that these sorts of Hadoop add-ons will only become more common.

These additions will also put more stress on the open source foundation of Hadoop. In many ways, the proliferation of different approaches shows the strength of the open source approach. The commercial vendors can collaborate on the care while competing on which extra features to add to the mix.

Still, there's bound to be some tension between the companies as they add material. I'm hopeful that the spirit of cooperation will continue to be strong enough to keep everyone working together, but there's no reason to assume that it will always be so. When people get good ideas, they'll roll them into their own clusters first and they may or may not contribute them back to the original distribution.

This article, "Enterprise Hadoop: Big data processing made easier," was originally published at InfoWorld.com. Follow the latest developments in business technology news and get a digest of the key stories each day in the InfoWorld Daily newsletter. For the latest business technology news, follow InfoWorld on Twitter.

Read more about business intelligence in InfoWorld's Business Intelligence Channel.