Pages

Sunday, 4 April 2021

Map Reduce

Traditional Approach :

In this approach, an enterprise will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MY SQL Server, or DB2 and sophisticated software can be written to interact with the database, process the required data, and present it to the users for analysis purposes.

Limitation :

This approach works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with huge amounts of data, it is really a tedious task to process such data through a traditional database server.

Google’s Solution :

Google solved this problem using an algorithm called Map Reduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset.

The above diagram shows various commodity hardware's which could be single CPU machines or servers with higher capacity.

Map Reduce :

The Map Reduce algorithm contains two important tasks, namely Map and Reduce.

Map (Splits, Mapping )

Reduce (Shuffling, Reducing)

The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs).

The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples.

The reduce task is always performed after the map job. 

The different phases in MapReduce tasks are :

Input Phase :

Here we have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs.

Map Phase :

Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate zero or more key-value pairs.

Intermediate Keys  :

The key-value pairs generated by the mapper are known as intermediate keys.

Combiner  :

A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm, it is optional.

Shuffle and Sort  :

The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task.

Reducer Phase:

The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.

Output Phase :

In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.

MapReduce Example 1:


MapReduce Example 2:
Consider you have the following input data for your MapReduce in Big data Program

Input data:
Welcome to Hadoop 
Class Hadoop is
good Hadoop is 
bad

Final output of the MapReduce :
bad 1
Class 1
good 1
Hadoop 3
is 2
to 1
Welcome   1

The data goes through the following phases of MapReduce 

Input Splits :
An input to a MapReduce in Big Data job is divided into fixed-size pieces called input splits Input split is a chunk of the input that is consumed by a single map

Mapping :
This is the very first phase in the execution of the map-reduce program. In this phase data in each split is passed to a mapping function to produce output values. In our example, a job of the mapping phase is to count a number of occurrences of each word from input splits (more details about input-split is given below) and prepare a list in the form of <word, frequency>

Shuffling :
This phase consumes the output of the Mapping phase. Its task is to consolidate the relevant records from the Mapping phase output. In our example, the same words are clubbed together along with their respective frequency.

Reducing :
In this phase, output values from the Shuffling phase are aggregated. This phase combines values from the Shuffling phase and returns a single output value. In short, this phase summarizes the complete dataset.



Working :
The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of entities 
1. job tracker or Resource Manager: Acts like a master (responsible for complete execution of submitted job)
2. Multiple Task Trackers or Node Manager: Acts like slaves, each of them performing the job. For every job submitted for execution in the system, there is one job tracker that resides on Namenode and there are multiple task trackers which reside on Datanode.
  • A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
  • It is the responsibility of the job tracker to coordinate the activity by scheduling tasks to run on different data nodes.
  • Execution of individual tasks is then to look after by task tracker, which resides on every data node executing part of the job.
  • Task tracker's responsibility is to send the progress report to the job tracker.
  • In addition, the task tracker periodically sends a 'heartbeat' signal to the job tracker so as to notify him of the current state of the system. 
  • Thus job tracker keeps track of the overall progress of each job. In the event of a task failure, the job tracker can reschedule it on a different task tracker.

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...