Map Reduce Moves Data Processing Goal Posts

March 15, 2012

What is MapReduce?

Originally defined by Google, MapReduce is a programming model for processing large data sets. A Map function breaks down a dataset made up of key/value pairs into an intermediate set of key/value pairs. A Reduce function merges all intermediate pairs associated with the same key.

MapReduce was designed as a simple abstraction to make large scale parallel data processing problems easier to deal with, hiding details of data distribution and networking within a framework.

The framework will take care of ensuring the data is broken up, processed and merged back together. The programmer’s typical role is just to provide appropriate implementations of Map and Reduce functions. This allows them to focus on the logic of the analysis rather than the plumbing of the parallelized operation.

Data that can be readily broken down into discrete packets which can be analysed in parallel lends itself well to use of MapReduce technology.

MapReduce frameworks will typically optimize network bandwidth by distributing multiple copies of data across the cluster. Fault tolerance will also be provided for by the framework monitoring the health of machines in the cluster and getting jobs re-run if machines drop out the cluster.

Conceptually MapReduce is very simple but it is extremely powerful as with the right framework, programmers can quickly produce expressive code that the framework can parallelize and apply to very large data sets.

The map reduce concept was inspired by concepts originally present in Lisp and that appear in many functional programming languages. Lisp is one of the oldest programming languages in existence, but still embodies many extremely advanced concepts, some of which are seeing a resurgence in the likes of MapReduce but also in the emergence of JVM based languages like Clojure.


MapReduce can be scaled very highly as indicated in the following examples.

Facebook adopted map reduce in 2007 in order to mine increasingly large volumes of data. As per their 2008 Note, they were loading up to 250Gb per day and running MapReduce jobs over 2500 cores attached to a 1 Petabyte storage array. As mentioned in this paper, Facebook fostered developer creativity by allowing programmers to work in the language of their choice, but standardized on SQL as a way of expressing queries.

By 2011, Facebook had drastically increased their utilization of MapReduce, with a data warehouse of 30 Petabytes.

As of the 2011 Hadoop Summit, Yahoo were running 42,000 nodes.

In a move that makes MapReduce capability accessible on-demand at manageable cost, Amazon Web Services created the Elastic Map Reduce capability which opened the vast resources of the Amazon Web Services infrastructure for data analysis. Amazon seem to not want to disclose how many servers they have but it seems likely the number at least runs into hundreds of thousands if not millions.

Google are a significant user of MapReduce. Though they don’t publish exact details, analysis of available data can derive an approximate number of 900,000 servers a significant portion of which are being utilized for MapReduce functionality.


Hadoop is the pre-eminent MapReduce framework.

Managed by Apache, Hadoop includes the HDFS (Hadoop Distributed File System), a programming framework and a series of utilities and libraries.

Cloudera provide an extremely useful Distribution that comes with Hadoop and associated utilities.

Microsoft’s Dryad was created as their take on Big Data but did not gain wide acceptance. Microsoft are now utilizing Hadoop. Interestingly, they have announced plans to integrate Hadoop with Excel, opening up new levels of accessibility for non programmers.

As suggested by GigaOM there is a lot of start-up activity around Hadoop and BigData. It is however significant that Oracle has created a partnership with Cloudera in launching the BigData Appliance.

The Oracle Big Data Appliance in current full stack form comes with 18 servers and 648TB of storage. Each server has two 6 core high spec CPUs packing in a total of 216 cores in a rack. The full rack has 864Gb of memory.

Given Oracle also provide mechanisms to load data from Oracle structured databases (Oracle Loader for Hadoop) as well as connectors to technologies like R, Oracle’s Big Data Appliance presents one of the most compelling offerings for companies wanting to manage, mine and monetize large volumes of less structured data.

Though, given the wide array of choice and the high availability of open source technology, there are plenty of options for companies who wish to create a more bespoke solution.

What Programming Languages Support MapReduce?

One really good way to get straight to the data analysis without worrying about lower level programming is to use Pig. This provides a nice high level language for data analysis. There are however various lower level programming options.

Writing MapReduce code with Java is easy enough.

However, Hadoop provide Hadoop Streaming which allows any executable to be used for Map or Reduce jobs.

Python is more directly supported through the Dumbo project.

Hadoop Streaming can be used to write MapReduce jobs in Ruby.

Scoobi allows developers to use Scala. Though this is a relatively new framework.

Hadoop can be accessed from C.

Painful Paradigm Shift?

Like many good ideas, Map Reduce is relatively simple. Make large scale parallel data processing much, much easier. Allow people to gain deep insights into their data quickly and with minimal fuss. The beauty of MapReduce is in some ways tempered by its apparent countenance for scale over performance. Since MapReduce makes it easy to throw hardware at a problem, it might be argued that MapReduce is inefficient and a well coded piece of SQL running on well configured hardware could out-strip the MapReduce approach.

However, it seems unlikely to be practical or as cost effective to scale a traditional RDBMS cluster to anything like the same levels a MapReduce farm could be scaled to. MapReduce farms are meant to be made out of large numbers (thousands and then some) of commodity nodes that may be unreliable. Enterprise DB clusters are meant to be rock solid and typically run on very robust, expensive hardware. Scaling out that level of hardware to cope with non tabular data may simply not be practical.

A more considered view is that structured and un-structured database technologies are largely complementary. MapReduce is unlikely to replace RDBMS but offers very exciting possibilities for cost effective ways to continually gain insights into large volumes of data that may not yield to the same formal tabular structure of data handled by RDBMS.

Working with MapReduce requires some change in thinking, but adjusting to that change can yield powerful results.


Fire Proofing Architectures

April 14, 2010

Architecture fire proofing is making the right decisions when establishing the fundamental controlling principles of a technology solution. A successful  architecture will be cost effective, easy for available developers to work with, will scale well to defined future need, provide sufficient performance and be secure. A bad architecture can be very clever, use all the big name technologies, cost a fortune and fail in every practical way.

General Fire Proofing

There is no magic solution, no technology panacea, no standard that will guarantee success in the terms defined above.

Understand the Problem

Involve Senior/Client Friendly Architects early and get them in front of the client.

Architects need to understand the business problem they’re trying to solve in some considerable detail. If the architecture team prefer each other’s company to that of the business, that is a significant red flag. If an architect’s appreciation of the challenge is too shallow, the resulting architecture may work fine for some requirements but become unwieldy in more complex situations. This may lead to longer development times and higher maintenance costs.

Give architects a clear scope of functionality for this and future projects.

The architecture team also need to understand the business context of the project. A common failing is to try to build in too much flexibility to cope with requirements that nobody can say for sure will be needed. It’s the classic problem of trying to build a car that can transform into a kitchen sink at the press of a button (read 1 million lines of config).

Deal with Not Invented Here– Don’t Build Frameworks

Assemble powerful existing frameworks.

Completed applications deliver value. Building frameworks created potential value or potential cost / time save. In practice, realising those benefits is difficult.

Some developers love building frameworks. It’s a pure process that can perhaps be done in relative isolation without need to understand vast swathes of client requirements. Or spend time getting into the mucky reality of the business.

There are now a wide range of high quality frameworks that if adopted allow developers to spend much more time focused on client requirements.

No one framework is perfect for all occasions, but developers can now spend more time assembling and less time building.

Be Introspective

Have a realistic view of how your internal systems can be combined to synthesise new functionality.

Don’t assume systems will easily integrate.

Untested assumptions about the capability and ease of integration of systems can lead to massive problems down the line. It’s too easy to draw two system boxes on a piece of paper and link them. That simple line could turn out to be a huge can of worms.

Don’t Architect for Super Hero Developers– Unless you have them

Architectures need to be understandable and usable by the target development team.

A perfectly technically correct architecture is of little use if the development team can’t implement or use it (properly). This situation can come about for various reasons, but a common cause is architecture ivory towers where Architects can conceive something that is way beyond what the company’s mainstream development teams are skilled to use. It’s an impedance mismatch. Either get the architects to work at a lower level, get different architects who will, get the developers to step up or get different developers. None of these are easy choices.

Technology Standards are a Double Edged Sword

The company standard may not be the best option for certain types of problems. Be prepared to make a case for why something else is better.

For application development, the choice of stack is no longer as simple as Java or Microsoft. The emergence of technologies like Groovy, Python, Clojure, Scala (to name a few) offer significant, robust alternatives that may solve certain types of problems with far greater aplomb.

Get Specialists Where Needed

Get people who really know your technology of choice, not people who’ve just converted.

Modern, popular procedural languages have many similarities. It should not be difficult for an experienced C# developer to cross train to Java. However, fluency and elegance in use of libraries can take a great deal longer to acquire. Be aware of that in planning and consider injecting expert resource.

When Trying to Break Records, Get Test Pilots

Know when you’re stretching the limits of the technology and get people who know how to fly it to and beyond its limits.

If you’re going to do something that’s outside the normal operating envelope of a technology, you may need high end skills to get it set-up, before handing back.

Don’t Use Technology Because it’s Cool

Just because it’s new and being hyped up by big names does not mean you should use it.

Some architects and developers will be attracted to the new because it’s exciting and shows promise of solving hard problems easily and with greater style. In some cases, the risk of taking the new technology may be worth it because solving the problem in a more traditional way may be massively more labour intensive. However, some technologies can tend to be over-sold and disappoint.

Be prepared to scientifically pilot technologies to know if they’ll do what they claim or come with a lot of limitations.

Recognizing a Failing Architecture

Does everything seem to take forever?

How much work does it take to implement a ‘simple’ piece of functionality? If vast amounts of ‘plumbing’ have to be written for every small piece of business logic, the architecture (or choice of technology, or the developers) may not be sufficiently well oriented to the problem.

Inadequate performance can mean a death sentence for an application. Study performance and see if the application is giving good response times under real world user volumes. Pervasive bad performance can have many causes but it can be a pointer to either a bad implementation (in places at least) or a correct implementation of an architecture that will be difficult to make perform.


Security of the Cloud?

July 23, 2009

Perception is at least 9 tenths of the law. In Information Security it’s more like 11 tenths. Information Security in the Cloud is an evolving subject. The main providers are gaining client confidence. Vendors are moving quickly to take advantage, but concerns over security are holding people back from gaining these benefits.

Companies are right to be circumspect but before writing off the Cloud as too scary, execs should ask just how secure their data is right now. Can the company do a better job that say Amazon? On average will the company mess up more or less times than say Microsoft or IBM? Will a global player have more or less experience based on hosting thousands of companies?

The average company has to decide their strategy with an eye to threat level but also practicality. The likes of Google, Microsoft and Amazon, whatever you may think of them have to define their threat level as extremely high. As such, the measures they take could go way beyond those of the average company. We also shouldn’t assume large companies do security well. Examples in recent years have shown that very clearly.

When considering security, Cloud may sound scary.However, the idea of data being somehow further away and less controlled or secure is potentially fallacious.

The IT capabilities of companies will vary greatly. Some will have huge expertise and sufficient staff numbers to dedicate to the security of the organization’s data. Some people will make it their life’s endeavour to keep one step ahead of known threats. In an ideal world the security team will be well managed, well funded, motivated, hold stake holder positions on all the right committees and be empowered to get critical things done and veto risky changes. Of course this may not always be the case.

Small companies may well do security better (subjective definition) than large companies.

What of companies with IT departments that run into the hundreds or thousands of people? Is it ever really possible to maintain standards? When a lot of people, systems and suppliers are involved, things can go wrong. Vulnerabilities can arise. You may think you’re secure, but how do you know? Or when will you know you’re not? The stealthy, focussed hacker may not attract attention for some time.

Diligent companies will of course employ a range of methods to protect and check themselves; secure design of code/infrastructure, IDS, AV, firewalls, external red teams etc. But then budget and practicality can over shadow the ideal. Security can get brushed aside without fully understanding the implications. And the ideal is never perfect. If you think you’re totally secure, you’re probably over confident.

Going the full distance with an armoury of diverse measures can be extremely expensive and time consuming. Especially in a large infrastructure. Compromises get made. Risks get assessed and working execs have to balance perception of risk against spend. Effectively; can I justify the platinum plated solution if I don’t think the risk is that high? Not an easy decision for any working CIO. And if they’re not fully informed or don’t have the security background themselves, it may be difficult to make risk assessments. Even if security sits with Head of Risk, the same problem applies. Unless you’re well informed, you don’t know what you don’t know. This is one to be paranoid about and make a very careful leadership appointments.

Even with a vast range of measures, human error can be the weak link, especially in companies that are changing quickly or may be in some form of internal crisis. Indeed, rapid growth can lead to holes in the eco-system.

Give the Cloud a realistic security evaluation. The main providers have significant talent pools and massive resources to manage risk. They can very possibly build stronger defences and when an issue does arise, respond more quickly– their reputation is at stake. There have of course been some famous issues, but it’s all about perspective. In context, has a cloud provider had an unreasonable number of issues when measured across everything they host?

The financial and flexibility benefits of the Cloud mean companies who want to get ahead should periodically re-evaluate their criteria used to decide where to host and their view on security needs. Without critical analysis of what’s needed and what’s possible, Cloud benefits may remain elusive.

Fundamentally, security responsibility cannot be abdicated to a Cloud provider or indeed an internal team. The buck should still stop with the CIO/Head of Risk. So the question is not, can I make security a Cloud company problem, it’s can I do security better if I have services in the Cloud?