Archive for March, 2012

h1

Map Reduce Moves Data Processing Goal Posts

March 15, 2012

What is MapReduce?

Originally defined by Google, MapReduce is a programming model for processing large data sets. A Map function breaks down a dataset made up of key/value pairs into an intermediate set of key/value pairs. A Reduce function merges all intermediate pairs associated with the same key.

MapReduce was designed as a simple abstraction to make large scale parallel data processing problems easier to deal with, hiding details of data distribution and networking within a framework.

The framework will take care of ensuring the data is broken up, processed and merged back together. The programmer’s typical role is just to provide appropriate implementations of Map and Reduce functions. This allows them to focus on the logic of the analysis rather than the plumbing of the parallelized operation.

Data that can be readily broken down into discrete packets which can be analysed in parallel lends itself well to use of MapReduce technology.

MapReduce frameworks will typically optimize network bandwidth by distributing multiple copies of data across the cluster. Fault tolerance will also be provided for by the framework monitoring the health of machines in the cluster and getting jobs re-run if machines drop out the cluster.

Conceptually MapReduce is very simple but it is extremely powerful as with the right framework, programmers can quickly produce expressive code that the framework can parallelize and apply to very large data sets.

The map reduce concept was inspired by concepts originally present in Lisp and that appear in many functional programming languages. Lisp is one of the oldest programming languages in existence, but still embodies many extremely advanced concepts, some of which are seeing a resurgence in the likes of MapReduce but also in the emergence of JVM based languages like Clojure.

Scale

MapReduce can be scaled very highly as indicated in the following examples.

Facebook adopted map reduce in 2007 in order to mine increasingly large volumes of data. As per their 2008 Note, they were loading up to 250Gb per day and running MapReduce jobs over 2500 cores attached to a 1 Petabyte storage array. As mentioned in this paper, Facebook fostered developer creativity by allowing programmers to work in the language of their choice, but standardized on SQL as a way of expressing queries.

By 2011, Facebook had drastically increased their utilization of MapReduce, with a data warehouse of 30 Petabytes.

As of the 2011 Hadoop Summit, Yahoo were running 42,000 nodes.

In a move that makes MapReduce capability accessible on-demand at manageable cost, Amazon Web Services created the Elastic Map Reduce capability which opened the vast resources of the Amazon Web Services infrastructure for data analysis. Amazon seem to not want to disclose how many servers they have but it seems likely the number at least runs into hundreds of thousands if not millions.

Google are a significant user of MapReduce. Though they don’t publish exact details, analysis of available data can derive an approximate number of 900,000 servers a significant portion of which are being utilized for MapReduce functionality.

Implementations

Hadoop is the pre-eminent MapReduce framework.

Managed by Apache, Hadoop includes the HDFS (Hadoop Distributed File System), a programming framework and a series of utilities and libraries.

Cloudera provide an extremely useful Distribution that comes with Hadoop and associated utilities.

Microsoft’s Dryad was created as their take on Big Data but did not gain wide acceptance. Microsoft are now utilizing Hadoop. Interestingly, they have announced plans to integrate Hadoop with Excel, opening up new levels of accessibility for non programmers.

As suggested by GigaOM there is a lot of start-up activity around Hadoop and BigData. It is however significant that Oracle has created a partnership with Cloudera in launching the BigData Appliance.

The Oracle Big Data Appliance in current full stack form comes with 18 servers and 648TB of storage. Each server has two 6 core high spec CPUs packing in a total of 216 cores in a rack. The full rack has 864Gb of memory.

Given Oracle also provide mechanisms to load data from Oracle structured databases (Oracle Loader for Hadoop) as well as connectors to technologies like R, Oracle’s Big Data Appliance presents one of the most compelling offerings for companies wanting to manage, mine and monetize large volumes of less structured data.

Though, given the wide array of choice and the high availability of open source technology, there are plenty of options for companies who wish to create a more bespoke solution.

What Programming Languages Support MapReduce?

One really good way to get straight to the data analysis without worrying about lower level programming is to use Pig. This provides a nice high level language for data analysis. There are however various lower level programming options.

Writing MapReduce code with Java is easy enough.

However, Hadoop provide Hadoop Streaming which allows any executable to be used for Map or Reduce jobs.

Python is more directly supported through the Dumbo project.

Hadoop Streaming can be used to write MapReduce jobs in Ruby.

Scoobi allows developers to use Scala. Though this is a relatively new framework.

Hadoop can be accessed from C.

Painful Paradigm Shift?

Like many good ideas, Map Reduce is relatively simple. Make large scale parallel data processing much, much easier. Allow people to gain deep insights into their data quickly and with minimal fuss. The beauty of MapReduce is in some ways tempered by its apparent countenance for scale over performance. Since MapReduce makes it easy to throw hardware at a problem, it might be argued that MapReduce is inefficient and a well coded piece of SQL running on well configured hardware could out-strip the MapReduce approach.

However, it seems unlikely to be practical or as cost effective to scale a traditional RDBMS cluster to anything like the same levels a MapReduce farm could be scaled to. MapReduce farms are meant to be made out of large numbers (thousands and then some) of commodity nodes that may be unreliable. Enterprise DB clusters are meant to be rock solid and typically run on very robust, expensive hardware. Scaling out that level of hardware to cope with non tabular data may simply not be practical.

A more considered view is that structured and un-structured database technologies are largely complementary. MapReduce is unlikely to replace RDBMS but offers very exciting possibilities for cost effective ways to continually gain insights into large volumes of data that may not yield to the same formal tabular structure of data handled by RDBMS.

Working with MapReduce requires some change in thinking, but adjusting to that change can yield powerful results.

Advertisements