Traditional Approach :
In this approach, an enterprise will have a computer to store and process big data. Here data will be stored in an RDBMS like Oracle Database, MY SQL Server, or DB2 and sophisticated software can be written to interact with the database, process the required data, and present it to the users for analysis purposes.
Limitation :
This approach works well where we have less volume of data that can be accommodated by standard database servers, or up to the limit of the processor which is processing the data. But when it comes to dealing with huge amounts of data, it is really a tedious task to process such data through a traditional database server.
Google’s Solution :
Google solved this problem using an algorithm called Map Reduce. This algorithm divides the task into small parts and assigns those parts to many computers connected over the network, and collects the results to form the final result dataset.
The above diagram shows various commodity hardware's which could be single CPU machines or servers with higher capacity.
Map Reduce :
The Map Reduce algorithm contains two important tasks, namely Map and Reduce.
• Map (Splits, Mapping )
• Reduce (Shuffling, Reducing)
The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples.
The reduce task is always performed after the map job.
The different phases in MapReduce tasks are :
Input Phase :
Here we have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs.
Map Phase :
Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate zero or more key-value pairs.
Intermediate Keys :
The key-value pairs generated by the mapper are known as intermediate keys.
Combiner :
A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm, it is optional.
Shuffle and Sort :
The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task.
Reducer Phase:
The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.
Output Phase :
In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.
MapReduce Example 1:
- A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
- It is the responsibility of the job tracker to coordinate the activity by scheduling tasks to run on different data nodes.
- Execution of individual tasks is then to look after by task tracker, which resides on every data node executing part of the job.
- Task tracker's responsibility is to send the progress report to the job tracker.
- In addition, the task tracker periodically sends a 'heartbeat' signal to the job tracker so as to notify him of the current state of the system.
- Thus job tracker keeps track of the overall progress of each job. In the event of a task failure, the job tracker can reschedule it on a different task tracker.
No comments:
Post a Comment