Introduction to Map Reduce with Hadoop (I)

maio 19, 2013

I thought of compiling some of the basics concepts related to Map Reduce and Hadoop along with a trivial sample to get started with Hadoop. Also, I would like to discuss about the new MapReduce API of Hadoop and samples will be based on the new API.

Why Map Reduce?

Nowadays, we are surrounded by huge amount of data and each one of us keep on consuming and generating data at every second. Facebook, YouTube, Twitter, LinkedIn, Googling and every other thing that we do on the internet is dealing with a huge amount of data. The main challenge here is to analyze this sort of huge volumes of data and make decisions based on the analysis. Google was the first initiator who came up with an abstraction called Map Reduce to address the challenges in parallel processing of high volume of data.

Fundamentals of Map Reduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets. It was originally developed by Google (MapReduce: Simplified Data Processing on Large Clusters) and built on well-known principles in parallel and distributed processing. Since then Map Reduce was extensively adopted for analyzing large data sets in its open source flavor Hadoop.

The original motivation behind Map Reduce aroused with the requirements of Google for processing large amount of raw data such as crawled documents, web requests logs etc and building up inverted index or graph representations. Despite those computations are not so complex, owing to the high volume of data, the computations has to be distributed across multiple machines (normally hundreds or thousands) along with the support for parallelization, fault-tolerance, data distribution and load balancing.

Map Reduce provide a new abstraction for all type of high volume data processing with allows the users to express the simple computations we were trying to perform but hides the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.

In most computation related to high data volumes, it is observed that two main phases are commonly used in most data processing components. The original authors of Map Reduce spotted this commonality and created an abstraction phases of Map Reduce model called 'mappers' and 'reducers' (Original idea was inspired from programming languages such as Lisp).

When it comes to processing large data sets, for each logical record in the input data it is often required to use a mapping function to create intermediate key value pairs . Then another phase called 'reduce' to be applied to the data that shares the same key, to derive the combined data appropriately.

Mapper

The mapper is applied to every input key-value pair (split across an arbitrary number of files) to generate an arbitrary number of intermediate key-value pairs. The standard representation of this is as follows:

map: (k1 , v1 ) → [(k2 , v2 )]

So, if we take an example of processing a large set of text files for obtaining word frequencies (word count), then the input of map function will be the file_name and the file_content which is denoted by k1 and v1. So, with in the map function user may emit the any arbitrary key/value pair as denoted in the list [k2, v2].

eg: Mapper for Word Count sample

Loading ....

Reducer

The reducer is applied to all values associated with the same intermediate key to generate output key-value pairs. It is important to keep in mind that, in between map and reduce jobs there exists an implicit distributed "group by" operation on intermediate keys and intermediate data arrive at each reducer
in order, sorted by the key.

Since we have an intermediate 'group by' operation, the input to the reducer function is a key value pair where the key-k2 is the one which is emitted from mapper and a list of values [v2] with shares the same key.

reduce: (k2 , [v2 ]) → [(k3 , v3 )]

eg: Reducer for Word Count sample

Loading ....

WordCount with Map Reduce

So, as we have covered all the fundamentals related to Map Reduce, its time to start writing our simple map reduce program to count words frequencies of a given data set.

images courtsey of : http://dme.rwth-aachen.de/de/research/projects/mapreduce
http://technorati.com/technology/article/massively-parallel-processing-of-big-data1/
http://www.rabidgremlin.com/data20/

Pesquisar este blog

CLOUD