BigData: Basic Building Blocks of Hadoop

What is Hadoop

Hadoop is an open source framework from Apache and is used to store process and analyze data which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more.

Hadoop HA

The high availability feature in Hadoop ensures the availability of the Hadoop cluster without any downtime, even in unfavorable conditions like NameNode failure, DataNode failure, machine crash, etc. It means if the machine crashes, data will be accessible from another path.

History of Hadoop

Year Event

2003 Google released the paper, Google File System (GFS).

2004 Google released a white paper on Map Reduce.

2006

Hadoop introduced.
Hadoop 0.1.0 released.
Yahoo deploys 300 machines and within this year reaches 600 machines.

2007

Yahoo runs 2 clusters of 1000 machines.
Hadoop includes HBase.

2008

YARN JIRA opened
Hadoop becomes the fastest system to sort 1 terabyte of data on a 900 node cluster within 209 seconds.
Yahoo clusters loaded with 10 terabytes per day.
Cloudera was founded as a Hadoop distributor.

2009

Yahoo runs 17 clusters of 24,000 machines.
Hadoop becomes capable enough to sort a petabyte.
MapReduce and HDFS become separate subproject.

2010

Hadoop added the support for Kerberos.
Hadoop operates 4,000 nodes with 40 petabytes.
Apache Hive and Pig released.

2011

Apache Zookeeper released.
Yahoo has 42,000 Hadoop nodes and hundreds of petabytes of storage.

2012 Apache Hadoop 1.0 version released.

2013 Apache Hadoop 2.2 version released.

2014 Apache Hadoop 2.6 version released.

2015 Apache Hadoop 2.7 version released.

2017 Apache Hadoop 3.0 version released.

2018 Apache Hadoop 3.1 version released.

Modules of Hadoop

HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS was developed. It states that the files will be broken into blocks and stored in nodes over the distributed architecture.

Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.

Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using key value pair. The Map task takes input data and converts it into a data set which can be computed in Key value pair. The output of Map task is consumed by reduce task and then the out of reducer gives the desired result.

Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS (Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes.

The master node includes

NameNode
Secondary Namenode
Job Tracker / Resource Manager

whereas the slave node includes

DataNode
Task Tracker / Node Manager

Note:

Namenode can communicate with Datanode vise-versa
Job Tracker can communicate with Task Tracker vise-versa

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains a master/slave architecture. This architecture consist of a single NameNode performs the role of master, and multiple DataNodes performs the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java language is used to develop HDFS. So any machine that supports Java language can easily run the NameNode and DataNode software.

NameNode

It is a single master server exist in the HDFS cluster.

As it is a single node, it may become the reason of single point failure.

It manages the file system namespace by executing an operation like the opening, renaming and closing the files.

It simplifies the architecture of the system.

DataNode

The HDFS cluster contains multiple DataNodes.

Each DataNode contains multiple data blocks.

These data blocks are used to store data.

It is the responsibility of DataNode to read and write requests from the file system's clients.

It performs block creation, deletion, and replication upon instruction from the NameNode.

Secondary Namenode

MapReduce Layer

The MapReduce comes into existence when the client application submits the MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.

Job Tracker

The role of Job Tracker is to accept the MapReduce jobs from client and process the data by using NameNode.

In response, NameNode provides metadata to Job Tracker.

Task Tracker

It works as a slave node for Job Tracker.

It receives task and code from Job Tracker and applies that code on the file. This process can also be called as a Mapper.

Different layers used in Hadoop Architecture :

The four distinct layers are

Application Programming interface
Processing Framework Layer
Cluster Resource management
Distributed Storage Layer

Advantages of Hadoop

Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval .It is able to process terabytes of data in minutes and Peta bytes in hours.

Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.

Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost effective as compared to traditional relational database management system.

Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one node is down or some other network failure happens, then Hadoop takes the other copy of data and use it. Normally, data are replicated thrice but the replication factor is configurable.

BigData

Pages

Friday, 9 April 2021

Basic Building Blocks of Hadoop - HDFS & MapReduce

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Report Abuse

Labels