Pages

Sunday 2 May 2021

HADOOP ARCHITECTURE

Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Hadoop framework mainly includes the following four modules:

Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. These libraries provides file system and OS level abstractions and contains the necessary Java files and scripts required to start Hadoop.

Hadoop YARN: This is a framework for job scheduling and cluster resource management.

Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.

Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.

The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts.

There are two primary components at the core of Apache Hadoop: 

  • Hadoop Distributed File System (HDFS) 
  • MapReduce parallel processing framework.

These are both open source projects, inspired by technologies created inside Google.


Features of HDFS :
  • Usage of distributed storage and processing.
  • Optimization is used for throughput over latency.
  • Efficient in reading large files but poor at seek requests for many small ones.
  • A command interface to interact with HDFS is provided.
  • The built-in servers of data node and name node helps the end users to check the cluster‘s status as per time intervals.
  • Streams access the data of file system.
  • Authentication of file permissions and authentication are provided.
  • Uses replication are used instead of handling disk failures. Each blocks comprises a file storage on several nodes inside the cluster and the HDFS NameNode continuously keep monitoring the reports which are sent by every DataNode to ensure due to failures no block have gone below the desired replication factor. If this this happens then it schedules the addition of another copy within the cluster.

Features of MapReduce
  • Scalability
  • Flexibility
  • Security and Authentication
  • Cost-effective solution
  • Fast
  • Simple model of programming
  • Parallel Programming
  • Availability and resilient nature
HDFS Architecture:

Apache Hadoop deals flexible, reliable and scalable distributed big data framework designed with storing capability and native computing by holding the commodity hardware

Hadoop architecture is based on master-slave design by which the processing of data in the Hadoop system will be performed. The large data sized files are stored on many servers and further operations can be reduced by mapping. Every server is considered as a node and the map containing each node will be computed.
  • Namenode: It controls the operation of data.
  • Datanode: The data is written to the local storage by the datanodes. It’s not recommended in storing the entire data at a same single location.
  • Task tracker: The tasks which were allocated to the slave nodes will be received and performed.
  • Map: The data is taken in a sequence and every line will be split after processing it in to different fields.
  • Reduce: the outputs obtained from map are grouped with each other.
The three main components of Hadoop which plays an important role in Hadoop architecture are:
1. Hadoop Distributed File System (HDFS). 
2. Yet Another Resource Negotiator (YARN).
3. Hadoop MapReduce

1. Hadoop Distributed File System (HDFS) :
The HDFS file is divided into different blocks and each block is replicated contained within the cluster of the Hadoop. The default size of a HDFS block of data within the filesystem is 64MB ,128MB and its size can also be extended upto 256MB depending upon the requirements.



Block :
Data in a Hadoop cluster is broken down into smaller units(called Blocks) and distributed throughout the cluster. Each block is duplicated twice (for a total of three copies),with the two replicas stored on two nodes in a rack somewhere else in the cluster 

Since the data has a default replication factor of three,it is highly available and fault tolerant.If a copy is lost (because of machine failure) for example HDFS will automatically re-replicate it elsewhere in the cluster,ensuring that the threefold replication factor is maintained .


HDFS blocks are huge compared to disk blocks and the main reason is cost reduction. By making a particular set of block large enough hence here the time consumed to transfer the data from the disk can be made larger than the time to seek from the beginning of the block. Thus the time consumed to transfer a large file made of multiple blocks operates at the disk transfer rate.

Rack Awareness :
In a large cluster of Hadoop, in order to improve the network traffic while reading/writing HDFS file, namenode chooses the datanode which is closer to the same rack or nearby rack to Read/Write request. 
Namenode achieves rack information by maintaining the rack id’s of each Datanode. This concept that chooses closer Datanodes based on the rack information is called Rack Awareness in Hadoop.
The main purpose of Rack awareness is:
Increasing the availability of data block
Better cluster performance

HDFS architecture can vary, depending on the Hadoop version and features needed Each cluster is typically composed of a single NameNode, an optional SecondaryNameNode (for data recovery in the event of failure), and an arbitrary number of DataNodes.

Namenode: 
  • The namenode is the commodity hardware that contains the GNU/Linux operating system and its library file setup software. 
  • NameNode is also known as Master node.
  • HDFS Namenode stores meta-data i.e. number of data blocks, replicas and other details. This meta-data is available in memory in the master for faster retrieval of data. 
  • NameNode maintains and manages the slave nodes, and assigns tasks to them. It should deploy on reliable hardware as it is the centerpiece of HDFS. 
Tasks :
  • Manage file system namespace.
  • Regulates client’s access to files.
  • It also executes file system execution such as naming, closing, opening files/directories.
  • All DataNodes sends a Heartbeat and block report to the NameNode in the Hadoop cluster. It ensures that the DataNodes are alive. A block report contains a list of all blocks on a datanode.
  • NameNode is also responsible for taking care of the Replication Factor of all the blocks.

Job Tracker :
  • JobTracker is an essential Daemon for MapReduce execution in MRv1. 
  • JobTracker receives the requests for MapReduce execution from the client.
  • JobTracker talks to the NameNode to determine the location of the data.
  • JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality (proximity of the data) and the available slots to execute a task on a given node.
  • JobTracker monitors the individual TaskTrackers and the submits back the overall status of the job back to the client.
  • JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.
  • When the JobTracker is down, HDFS will still be functional but the MapReduce execution can not be started and the existing MapReduce jobs will be halted.
TaskTracker :
  • TaskTracker runs on DataNode. Mostly on all DataNodes.
  • TaskTracker is replaced by Node Manager in MRv2.
  • Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
  • TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
  • TaskTracker will be in constant communication with the JobTracker signalling the progress of the task in execution.
  • TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive, JobTracker will assign the task executed by the TaskTracker to another node.
2. Yet Another Resource Negotiator (YARN):
YARN considered as the brain of the Hadoop ecosystem. The operations performed by the YARN are activity processing, resource allocation and job scheduling. The two important components of this resource negotiator are Node manager and Resource manager. The requests that were passed to the proper Node manager will be accomplished by the Resource manager. Each Data node will have Node manager and responsible for the execution of tasks. The architecture is as shown below.
 
Using the framework of YARN, a number of various applications of Hadoop can be run by a user without cumulating the workloads as it will utilize the resources actively. YARN provides agility, unique and new programming models, sociability and improving the utilization of clusters.

3. Hadoop MapReduce: 
Hadoop MapReduce plays a vital role in the Hadoop distributed computing platform which is of Java based programming prototype. A special acyclic directed graph which applies to a huge range of usecases of business are Map and Reduce.

A section of data is transformed into key-value pairs in the Map phase and those keys will be sorted by a Reduce phase by merging the values depending upon the key into a particular single output.




HADOOP VENDORS:

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...