What is Hadoop?
Kuldip Pabla, Ahrono & Associates
Hadoop is a fault-tolerant distributed system for data storage which is highly scalable. The scalability is the result of a Self-Healing High Bandwith Clustered Storage , known by the acronym of HDFS (Hadoop Distributed File System) and a specific fault-tolerant Distributed Processing, known as MapReduce.
Why Hadoop as part of the IT?It processes and analyzes variety of new and older data to extract meaningful business operations wisdom. Traditionally data moves to the computation node. In Hadoop, data is processed where the data resides . The type of questions one Hadoop helps answer are:
What types of data we handle today?Human-generated data that fits well into relational tables or arrays. Examples are conventional transactions – purchase/sale, inventory/manufacturing, employment status change, etc. This is the core data managed by OLTP relational DBMS everywhere. In the last decade, humans generated other kinds of data as well, like text, documents (text or otherwise), pictures, videos, slideware. Traditional relational databases are a poor home for this kind of data because:
Example of Hadoop usageNetflix (NASDAQ: NFLX) is a service offering online flat rate DVD and Blu-ray disc rental-by-mail and video streaming in the United States. It has over 100,000 titles and 10 million subscribers. The company has 55 million discs and, on average, ships 1.9 million DVDs to customers each day. Netflix offers Internet video streaming, enabling the viewing of films directly on a PC or TV at home.
Netflix’s movie recommendation algorithm uses Hive (underneath using Hadoop, HDFS & MapReduce) for query processing and Business Intelligence. Netflix collects all logs from website which are streaming logs collected using Hunu.
They parse 0.6TB of data running on Amazon S3 50 nodes. All data are processed for Business Intelligence using a software called MicroStrategy.
Hadoop challengesTraditionally, Hadoop was opened for developers. But the wide adoption and success of Hadoop depends on business users, not developers.
Commercial distributions will have to make it even easier for business analysts to use Hadoop.
Servers running Hadoop at Yahoo.com
Templates for business scripts are a start, but getting away from scripting altogether should be the long term goal for the business user segment. This has not happened yet. Nevertheless Cloudera is trying to win the business user segment, and if they succeed they will create an enterprise Hadoop market.
To best illustrate, here it is a quote from Yahoo Hadoop development team:
“The way Yahoo! uses Hadoop is changing. Previously, most Hadoop users at Yahoo! were researchers. Researchers are usually hungry for scalability and features, but they are fairly tolerant of failures. Few scientists even know what "SLA" means, and they are not in the habit of counting the number of nines in your uptime. Today, more and more of Yahoo! production applications have moved to Hadoop. These mission-critical applications control every aspect of Yahoo!'s operation, from personalizing user experience to optimizing ad placement. They run 24/7, processing many terabytes of data per day. They must not fail. So we are looking for software engineers who want to help us make sure Hadoop works for Yahoo! and the numerous Hadoop users outside Yahoo!”
Hadoop Integration with resource management cloud softwareOne such example is Oracle Grid Engine 6.2 Update 5. Cycle Computing also announced an integration with Hadoop. It reduces the cost of running Apache Hadoop applications by enabling them to share resources with other data center applications, rather than having to maintain a dedicated cluster for running Hadoop applications. Here is a relevant customer quote
“The Grid Engine software has dramatically lowered for us the cost of data intensive, Hadoop centered, computing. With its native understanding of HDFS data locality and direct support for Hadoop job submission, Grid Engine allows us to run Hadoop jobs within exactly the same scheduling and submission environment we use for traditional scalar and parallel loads. Before we were forced to either dedicate specialized clusters or to make use of convoluted, adhoc, integration schemes; solutions that were both expensive to maintain and inefficient to run. Now we have the best of both worlds: high flexibility within a single, consistent and robust, scheduling system"”
Getting Started with HadoopHadoop is an open source implementation of the MapReduce algorithms and distributed file system. Hadoop is primarily developed in Java. Writing a Java application, obviously, will give you much more control and presumably improved performance. However, it can be used with other environments including scripting languages using “streaming”. Streaming applications simply reads data from stdin and write their output to stdout.
Installing HadoopTo install Hadoop, you will need to download Hadoop Common (also referred as Hadoop Core) from http://hadoop.apache.org/common/. The binaries are available from Open Source under an Apache License. Once you have downloaded the Hadoop Common, follow the installation and configuration instructions.
Hadoop With Virtual MachineIf you have no experience playing with Hadoop, there is an easier way to install and experiment with Hadoop. Rather than installing a local copy of Hadoop, install a virtual machine from Yahoo! Virtual machine comes with Hadoop pre-installed and pre-configured and is almost ready to use. The virtual machine is available from their Hadoop tutorial. This tutorial includes well documented instructions for running the virtual machine and running Hadoop applications. The virtual machine, in addition to Hadoop, includes Eclipse IDE for writing Java based Hadoop applications.
By default, Hadoop distributions are configured to run on single machine and the Yahoo virtual machine is a good way to get going. However, the power of Hadoop comes from its inherent distributed nature and deploying distributed computing on a single machine misses its very point. For any serious processing with Hadoop, you’ll need many more machines. Amazon’s Elastic Compute Cloud (EC2) is perfect for this. An alternative option to running Hadoop on EC2 is to use the Cloudera distribution. And of course, you can set up your own cluster of Hadoop by following the Apache instructions. Resources