1) What is HDFS- Hadoop Distributed File System?
Hadoop distributed file system (HDFS) is the primary storage system of Hadoop. HDFS stores very large files running on a cluster of commodity hardware. It works on the principle of storage of less number of large files rather than the huge number of small files. HDFS stores data
reliably even in the case of hardware failure. It provides high throughput access to the application by accessing in parallel.
Components of HDFS:
- NameNode – It is also known as Master node. Namenode stores meta-data i.e. number of blocks, their replicas and other details.
- DataNode – It is also known as Slave. In Hadoop HDFS, DataNode is responsible for storing actual data. DataNode performs read and write operation as per request for the clients in HDFS.
I. NameNode – It is also known as Master node. Namenode stores meta-data i.e. number of blocks, their location, replicas and other details. This meta-data is available in memory in the master for faster retrieval of data. NameNode maintains and manages the slave nodes, and assigns tasks to them. It should be deployed on reliable hardware as it is the centerpiece of HDFS.
- Task of NameNode
- Manage file system namespace.
- Regulates client‟s access to files.
- In HDFS, NameNode also executes file system execution such as naming, closing, opening files and directories.
- Task of DataNode
- In HDFS, DataNode performs various operations like block replica creation, deletion, and replication according to the instruction of NameNode.
- DataNode manages data storage of the system.
3) Why is block size set to 128 MB in Hadoop HDFS?
Block is a continuous location on the hard drive which stores the data. In general, FileSystem store data as a collection of blocks. HDFS stores each file as blocks, and distributes it across
the Hadoop cluster. In HDFS, the default size of data block is 128 MB, which we can configure as per our requirement. Block size is set to 128 MB:
Block is a continuous location on the hard drive which stores the data. In general, FileSystem store data as a collection of blocks. HDFS stores each file as blocks, and distributes it across
the Hadoop cluster. In HDFS, the default size of data block is 128 MB, which we can configure as per our requirement. Block size is set to 128 MB:
- To reduce the disk seeks (IO). Larger the block size, lesser the file blocks and less number of disk seek and transfer of the block can be done within respectable limits and that to parallelly.
- HDFS have huge data sets, i.e. terabytes and petabytes of data. If we take 4 KB block size for HDFS, just like Linux file system, which have 4 KB block size, then we would be having too many blocks and therefore too much of metadata. Managing this huge number of blocks and metadata will create huge overhead and traffic which is something which we don‟t want. So, the block size is set to 128 MB.On the other hand, block size can‟t be so large that the system is waiting a very long time for the last unit of data processing to finish its work.
4) How data or file is written into HDFS?
When a client wants to write a file to HDFS, it communicates to namenode for metadata. The Namenode responds with details of a number of blocks, replication factor. Then, on basis of
information from NameNode, client split files into multiple blocks. After that client starts sending them to first DataNode. The client sends block A to Datanode 1 with other twoDatanodes details.
When Datanode 1 receives block A sent from the client, Datanode 1 copy same block to
Datanode 2 of the same rack. As both the Datanodes are in the same rack so block transfer via
rack switch. Now Datanode 2 copies the same block to Datanode 3. As both the Datanodes are in
different racks so block transfer via an out-of-rack switch.
After the Datanode receives the blocks from the client. Then Datanode sends write confirmation
to Namenode. Now Datanode sends write confirmation to the client. The Same process will
repeat for each block of the file. Data transfer happen in parallel for faster write of blocks.
5) Can multiple clients write into an HDFS file concurrently?
Multiple clients cannot write into an HDFS file at same time. Apache Hadoop HDFS follows
single writer multiple reader models. The client which opens a file for writing, the NameNode
grant a lease. Now suppose, some other client wants to write into that file. It asks NameNode for
the write operation. NameNode first checks whether it has granted the lease for writing into that
file to someone else or not. When someone already acquires the lease, then, it will reject the
write request of the other client.
6) How data or file is read in HDFS?
To read from HDFS, the first client communicates to namenode for metadata. A client comes out
of namenode with the name of files and its location. The Namenode responds with details of the
number of blocks, replication factor. Now client communicates with Datanode where the blocks
are present. Clients start reading data parallel from the Datanode. It read on the basis of
information received from the namenodes.
Once client or application receives all the blocks of the file, it will combine these blocks to form
a file. For read performance improvement, the location of each block ordered by their distance
from the client. HDFS selects the replica which is closest to the client. This reduces the read
latency and bandwidth consumption. It first read the block in the same node. Then another node
in the same rack, and then finally another Datanode in another rack.
7) Why HDFS stores data using commodity hardware despite the higher chance of
failures?
HDFS stores data using commodity hardware because HDFS is highly fault-tolerant. HDFS
provides fault tolerance by replicating the data blocks. And then distribute it among different
DataNodes across the cluster. By default, replication factor is 3 which is configurable.
Replication of data solves the problem of data loss in unfavorable conditions. And unfavorable
conditions are crashing of the node, hardware failure and so on. So, when any machine in the
cluster goes down, then the client can easily access their data from another machine. And this
machine contains the same copy of data blocks.
8) How is indexing done in HDFS?
Hadoop has a unique way of indexing. Once Hadoop framework store the data as per the block
size. HDFS will keep on storing the last part of the data which will say where the next part of the
data will be. In fact, this is the base of HDFS.
9) What is a Heartbeat in HDFS?
Heartbeat is the signals that NameNode receives from the DataNodes to show that it is
functioning (alive). NameNode and DataNode do communicate using Heartbeat. If after the
certain time of heartbeat NameNode do not receive any response from DataNode, then that Node
is dead. The NameNode then schedules the creation of new replicas of those blocks on other
DataNodes.
Heartbeats from a Datanode also carry information about total storage capacity. Also, carry the
fraction of storage in use, and the number of data transfers currently in progress.
The default heartbeat interval is 3 seconds. One can change it by
using dfs.heartbeat.interval in hdfs-site.xml.
10) How to copy a file into HDFS with a different block size to that of existing block size
configuration?
One can copy a file into HDFS with a different block size by using:
–Ddfs.blocksize=block_size, where block_size is in bytes.
So, let us explain it with an example:
Suppose, you want to copy a file called test.txt of size, say of 128 MB, into the hdfs. And for this
file, you want the block size to be 32MB (33554432 Bytes) in place of the default (128 MB). So,
you would issue the following command:
Hadoop fs –Ddfs.blocksize=33554432 –copyFromlocal/home/dataflair/test.txt/sample_hdfs
Now, you can check the HDFS block size associated with this file by:
hadoop fs –stat %o/sample_hdfs/test.txt
Else, you can also use the NameNode web UI for seeing the HDFS directory.
11) Why HDFS performs replication, although it results in data redundancy?
In HDFS, Replication provides the fault tolerance. Data replication is one of the most important
and unique features of HDFS. Replication of data solves the problem of data loss in unfavorable
conditions. Unfavorable conditions are crashing of the node, hardware failure and so on. HDFS
by default creates 3 replicas of each block across the cluster in Hadoop. And we can change it as
per the need. So, if any node goes down, we can recover data on that node from the other node.
In HDFS, Replication will lead to the consumption of a lot of space. But the user can always add
more nodes to the cluster if required. It is very rare to have free space issues in the practical
cluster. As the very first reason to deploy HDFS was to store huge data sets. Also, one can
change the replication factor to save HDFS space. Or one can also use different codec provided
by the Hadoop to compress the data.
12) What is the default replication factor and how will you change it?
The default replication factor is 3. One can change this in following three ways:
By adding this property to hdfs-site.xml:
1. <property>
2. <name>dfs.replication</name>
3. <value>5</value>
4. <description>Block Replication</description>
5. </property>
You can also change the replication factor on per-file basis using the command:hadoop fs –
setrep –w 3 / file_location
You can also change replication factor for all the files in a directory by using:hadoop fs –
setrep –w 3 –R / directoey_location
13) Explain Hadoop Archives?
Apache Hadoop HDFS stores and processes large (terabytes) data sets. However, storing a large
number of small files in HDFS is inefficient, since each file is stored in a block, and block
metadata is held in memory by the namenode.
Reading through small files normally causes lots of seeks and lots of hopping from datanode to
datanode to retrieve each small file, all of which is inefficient data access pattern.
Hadoop Archive (HAR) basically deals with small files issue. HAR pack a number of small
files into a large file, so, one can access the original files in parallel transparently (without
expanding the files) and efficiently.
Hadoop Archives are special format archives. It maps to a file system directory. Hadoop Archive
always has a *.har extension. In particular, Hadoop MapReduce uses Hadoop Archives as an
Input.
14) What do you mean by the High Availability of a NameNode in Hadoop HDFS?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF), if namenode fails, all clients
including MapReduce jobs would be unable to read, write file or list files. In such event, whole
Hadoop system would be out of service until new namenode is brought online.
Hadoop 2.0 overcomes this single point of failure by providing support for multiple
NameNode. High availability feature provides an extra NameNode (active standby NameNode)
to Hadoop architecture which is configured for automatic failover. If active NameNode fails,
then Standby Namenode takes all the responsibility of active node and cluster continues to work.
The initial implementation of HDFS namenode high availability provided for single active
namenode and single standby namenode. However, some deployment requires high degree faulttolerance, this is enabled by new version 3.0, which allows the user to run multiple standby
namenode. For instance configuring three namenode and five journal nodes, the cluster is able to
tolerate the failure of two nodes rather than one.
15) What is Fault Tolerance in HDFS?
Fault-tolerance in HDFS is working strength of a system in unfavorable conditions ( like the
crashing of the node, hardware failure and so on). HDFS control faults by the process of replica
creation. When client stores a file in HDFS, then the file is divided into blocks and blocks of data
are distributed across different machines present in HDFS cluster. And, It creates a replica of
each block on other machines present in the cluster. HDFS, by default, creates 3 copies of a
block on other machines present in the cluster. If any machine in the cluster goes down or fails
due to unfavorable conditions, then also, the user can easily access that data from other machines
in the cluster in which replica of the block is present.
16) What is Rack Awareness?
Rack Awareness improves the network traffic while reading/writing file. In which NameNode
chooses the DataNode which is closer to the same rack or nearby rack. NameNode achieves rack
information by maintaining the rack IDs of each DataNode. This concept that chooses Datanodes
based on the rack information. In HDFS, NameNode makes sure that all the replicas are not
stored on the same rack or single rack. It follows Rack Awareness Algorithm to reduce latency
as well as fault tolerance.
Default replication factor is 3, according to Rack Awareness Algorithm. Therefore, the first
replica of the block will store on a local rack. The next replica will store on another datanode
within the same rack. And the third replica stored on the different rack.
In Hadoop, we need Rack Awareness because it improves:
Data high availability and reliability.
The performance of the cluster.
Network bandwidth.
17) Explain the Single point of Failure in Hadoop?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, all clients
would unable to read/write files. In such event, whole Hadoop system would be out of service
until new namenode is up.
Hadoop 2.0 overcomes this SPOF by providing support for multiple NameNode. High
availability feature provides an extra NameNode to Hadoop architecture. This feature
provides automatic failover. If active NameNode fails, then Standby-Namenode takes all the
responsibility of active node. And cluster continues to work.
The initial implementation of Namenode high availability provided for single active/standby
namenode. However, some deployment requires high degree fault-tolerance. So new version 3.0
enable this feature by allowing the user to run multiple standby namenode. For instance
configuring three namenode and five journal nodes. So, the cluster is able to tolerate the failure
of two nodes rather than one.
18) Explain Erasure Coding in Hadoop?
In Hadoop, by default HDFS replicates each block three times for several purposes. Replication
in HDFS is very simple and robust form of redundancy to shield against the failure of datanode.
But replication is very expensive. Thus, 3 x replication scheme has 200% overhead in storage
space and other resources.
Thus, Hadoop 2.x introduced Erasure Coding a new feature to use in the place of Replication.
It also provides the same level of fault tolerance with less space store and 50% storage overhead.
Erasure Coding uses Redundant Array of Inexpensive Disk (RAID). RAID implements EC
through striping. In which it divide logical sequential data (such as a file) into the smaller unit
(such as bit, byte or block). Then, stores data on different disk.
Encoding- In this process, RAID calculates and sort Parity cells for each strip of data cells. And
recover error through the parity. Erasure coding extends a message with redundant data for fault
tolerance. EC codec operates on uniformly sized data cells. In Erasure Coding, codec takes a
number of data cells as input and produces parity cells as the output. Data cells and parity cells
together are called an erasure coding group.
There are two algorithms available for Erasure Coding:
XOR Algorithm
Reed-Solomon Algorithm
19) What is Disk Balancer in Hadoop?
HDFS provides a command line tool called Diskbalancer. It distributes data evenly on all disks
of a datanode. This tool operates against a given datanode and moves blocks from one disk to
another.
Disk balancer works by creating a plan (set of statements) and executing that plan on the
datanode. Thus, the plan describes how much data should move between two disks. A plan
composes multiple steps. Move step has source disk, destination disk and the number of bytes to
move. And the plan will execute against an operational datanode.
By default, disk balancer is not enabled; Hence, to enable disk
balancer dfs.disk.balancer.enabled must be set true in hdfs-site.xml.
When we write new block in hdfs, then, datanode uses volume choosing the policy to choose the
disk for the block. Each directory is the volume in hdfs terminology. Thus, two such policies are:
Round-robin: It distributes the new blocks evenly across the available disks.
Available space: It writes data to the disk that has maximum free space (by percentage).
20) How would you check whether your NameNode is working or not?
There are several ways to check the status of the NameNode. Mostly, one uses the jps command
to check the status of all daemons running in the HDFS.
21) Is Namenode machine same as DataNode machine as in terms of hardware?
Unlike the DataNodes, a NameNode is a highly available server. That manages the File System
Namespace and maintains the metadata information. Metadata information is a number of
blocks, their location, replicas and other details. It also executes file system execution such as
naming, closing, opening files/directories.
Therefore, NameNode requires higher RAM for storing the metadata for millions of files.
Whereas, DataNode is responsible for storing actual data in HDFS. It performsread and write
operation as per request of the clients. Therefore, Datanode needs to have a higher disk capacity
for storing huge data sets.
22) What are file permissions in HDFS and how HDFS check permissions for files or
directory?
For files and directories, Hadoop distributed file system (HDFS) implements a permissions
model. For each file or directory, thus, we can manage permissions for a set of 3 distinct user
classes:
The owner, group, and others.
The 3 different permissions for each user class: Read (r), write (w), and execute(x).
For files, the r permission is to read the file, and the w permission is to write to the file.
For directories, the r permission is to list the contents of the directory. The wpermission
is to create or delete the directory.
X permission is to access a child of the directory.
HDFS check permissions for files or directory:
We can also check the owner‟s permissions if the username matches the owner of the
directory.
If the group matches the directory‟s group, then Hadoop tests the user‟s group
permissions.
Hadoop tests the “other” permission when the owner and the group names don‟t match.
If none of the permissions checks succeed, the client‟s request is denied.
23) If DataNode increases, then do we need to upgrade NameNode?
Namenode stores meta-data i.e. number of blocks, their location, replicas. This meta-data is
available in memory in the master for faster retrieval of data. NameNode maintains and manages
the slave nodes, and assigns tasks to them. It regulates client‟s access to files.
It also executes file system execution such as naming, closing, opening files/directories.
During Hadoop installation, framework determines NameNode based on the size of the cluster.
Mostly we don‟t need to upgrade the NameNode because it does not store the actual data. But it
stores the metadata, so such requirement rarely arise.
These Big Data Hadoop interview questions are the selected ones which are asked frequently and
by going through these HDFS interview questions you will be able to answer many other related
answers in your interview.
MAP REDUCE
1) What is Hadoop MapReduce?
MapReduce is the data processing layer of Hadoop. It is the framework for writing applications
that process the vast amount of data stored in the HDFS. It processes a huge amount of data in
parallel by dividing the job into a set of independent tasks (sub-job). In Hadoop, MapReduce
works by breaking the processing into phases: Map and Reduce.
Map- It is the first phase of processing. In which we specify all the complex
logic/business rules/costly code. The map takes a set of data and converts it into another set
of data. It also breaks individual elements into tuples (key-value pairs).
Reduce- It is the second phase of processing. In which we specify light-weight
processing like aggregation/summation. Reduce takes the output from the map as input. After
that, it combines tuples (key-value) based on the key. And then, modifies the value of the key
accordingly.
2) Why Hadoop MapReduce?
When we store huge amount of data in HDFS, the first question arises is, how to process this
data?
Transferring all this data to a central node for processing is not going to work. And we will have
to wait forever for the data to transfer over the network. Google faced this same problem with
its Distributed Goggle File System (GFS). It solved this problem using a MapReduce data
processing model.
Challenges before MapReduce:
Costly – All the data (terabytes) in one server or as database cluster which is very
expensive. And also hard to manage.
Time-consuming – By using single machine we cannot analyze the data (terabytes) as it
will take a lot of time.
MapReduce overcome these challenges:
Cost-efficient – It distribute the data over multiple low config machines.
Time-efficient – If we want to analyze the data. We can write the analysis code in Map
function. And the integration code in Reduce function and execute it. Thus, this MapReduce
code will go to every machine which has a part of our data and executes on that specific part.
Hence instead of moving terabytes of data, we just move kilobytes of code. So this type of
movement is time-efficient.
3) What is the key- value pair in MapReduce?
Hadoop MapReduce implements a data model, which represents data as key-value pairs. Both
input and output to MapReduce Framework should be in Key-value pairs only.
In Hadoop, if the schema is static we can directly work on the column instead of key-value. But,
the schema is not static we will work on keys and values. Keys and values are not the intrinsic
properties of the data. But the user analyzing the data chooses a key-value pair. A Key-value pair
in Hadoop MapReduce generate in following way:
InputSplit- It is the logical representation of data. InputSplit represents the data which
individual Mapper will process.
RecordReader- It communicates with the InputSplit (created by InputFormat). And
converts the split into records. Records are in form of Key-value pairs that are suitable for
reading by the mapper. By Default RecordReader uses TextInputFormat for converting data
into a key-value pair.
Key- It is the byte offset of the beginning of the line within the file, so it will be unique if
combined with the file.
Value- It is the contents of the line, excluding line terminators. For Example file content is- on
the top of the crumpetty Tree
Key- 0
Value- on the top of the crumpetty Tree
4) Why MapReduce uses the key-value pair to process the data?
MapReduce works on unstructured and semi-structured data apart from structured data. One can
read the Structured data like the ones stored in RDBMS by columns. But handling unstructured
data is feasible using key-value pairs. And the very core idea of MapReduce work on the basis of
these pairs. Framework map data into a collection of key-value pairs by mapper and reducer on
all the pairs with the same key. So as stated by Google themselves in their research publication.
In most of the computationsMap operation applies on each logical “record” in our input. This computes a set of intermediate
key-value pairs. Then apply reduce operation on all the values that share the same key. This
combines the derived data properly.
In conclusion, we can say that key-value pairs are the best solution to work on data problems on
MapReduce.
5) How many Mappers run for a MapReduce job in Hadoop?
Mapper task processes each input record (from RecordReader) and generates a key-value pair.
The number of mappers depends on 2 factors:
The amount of data we want to process along with block size. It depends on the number
of InputSplit. If we have the block size of 128 MB and we expect 10TB of input data, thus
we will have 82,000 maps. Ultimately InputFormat determines the number of maps.
The configuration of the slave i.e. number of core and RAM available on the slave. The
right number of map/node can between 10-100. Hadoop framework should give 1 to 1.5
cores of the processor to each mapper. Thus, for a 15 core processor, 10 mappers can run.
In MapReduce job, by changing the block size one can control the number of Mappers.
Hence, by Changing block size the number of InputSplit increases or decreases.
By using the JobConf‟s conf.setNumMapTasks(int num) one can increase the number of map
tasks manually.
Mapper= {(total data size)/ (input split size)}
If data size= 1 Tb and input split size= 100 MB
Hence, Mapper= (1000*1000)/100= 10,000
6) How many Reducers run for a MapReduce job in Hadoop?
Reducer takes a set of an intermediate key-value pair produced by the mapper as the input. Then
runs a reduce function on each of them to generate the output. Thus, the output of the reducer is
the final output, which it stores in HDFS. Usually, in the reducer, we do aggregation or
summation sort of computation.
With the help of Job.setNumreduceTasks(int) the user set the number of reducers for the job.
Hence the right number of reducers are set by the formula:
0.95 Or 1.75 multiplied by (<no. of nodes> * <no. of the maximum container per node>).
With 0.95, all the reducers can launch immediately and start transferring map outputs as the map
finish.
With 1.75, faster node finishes the first round of reduces and then launch the second wave of
reduces.
By increasing the number of reducers:
Framework overhead increases
Increases load balancing
Lowers the cost of failures
7) What is the difference between Reducer and Combiner in Hadoop MapReduce?
The Combiner is Mini-Reducer that perform local reduce task. The Combiner runs on the Map
output and produces the output to reducer input. A combiner is usually used for network
optimization. Reducer takes a set of an intermediate key-value pair produced by the mapper as
the input. Then runs a reduce function on each of them to generate the output. An output of the
reducer is the final output.
Unlike a reducer, the combiner has a limitation. i.e. the input or output key and value
types must match the output types of the mapper.
Combiners can operate only on a subset of keys and values. i.e. combiners can execute on
functions that are commutative.
Combiner functions take input from a single mapper. While reducers can take data from
multiple mappers as a result of partitioning.
8) What happens if the number of reducers is 0 in Hadoop?
If we set the number of reducer to 0, then no reducer will execute and no aggregation will take
place. In such case, we will prefer “Map-only job” in Hadoop. In a map-Only job, the map does
all task with its InputSplit and the reducer does no job. Map output is the final output.
Between map and reduce phases there is key, sort, and shuffle phase. Sort and shuffle phase are
responsible for sorting the keys in ascending order. Then grouping values based on same keys.
This phase is very expensive. If reduce phase is not required we should avoid it. Avoiding reduce
phase would eliminate sort and shuffle phase as well. This also saves network congestion. As in
shuffling an output of mapper travels to the reducer, when data size is huge, large data travel to
the reducer.
9) What do you mean by shuffling and sorting in MapReduce?
Shuffling and Sorting takes place after the completion of map task. Shuffle and sort phase in
hadoop occurs simultaneously.
Shuffling- It is the process of transferring data from the mapper to reducer. i.e., the process by
which the system sorts the key-value output of the map tasks and transfer it to the reducer.
So, shuffle phase is necessary for reducer, otherwise, they would not have any input. As
shuffling can start even before the map phase has finished. So this saves some time and
completes the task in lesser time.
Sorting- Mapper generate the intermediate key-value pair. Before starting of reducer,
MapReduce framework sort these key-value pairs by the keys.
Sorting helps reducer to easily distinguish when a new reduce task should start. Thus saves time
for the reducer.
Shuffling and sorting are not performed at all if you specify zero reducers
(setNumReduceTasks(0)).
Any doubt in the Hadoop Interview Questions and Answers yet? Just Drop a comment and we
will get back to you.
10) What is the fundamental difference between a MapReduce InputSplit and HDFS
block?
By definition
Block – Block is the continuous location on the hard drive where data HDFS store data.
In general, FileSystem store data as a collection of blocks. In a similar way, HDFS stores
each file as blocks, and distributes it across the Hadoop cluster.
InputSplit- InputSplit represents the data which individual Mapper will process. Further
split divides into records. Each record (which is a key-value pair) will be processed by the
map.
Data representation
Block- It is the physical representation of data.
InputSplit- It is the logical representation of data. Thus, during data processing in
MapReduce program or other processing techniques use InputSplit. In MapReduce,
important thing is that InputSplit does not contain the input data. Hence, it is just a reference
to the data.
Size
Block- The default size of the HDFS block is 128 MB which is configured as per our
requirement. All blocks of the file are of the same size except the last block. The last Block
can be of same size or smaller. In Hadoop, the files split into 128 MB blocks and then stored
into Hadoop Filesystem.
InputSplit- Split size is approximately equal to block size, by default.
Example
Consider an example, where we need to store the file in HDFS. HDFS stores files as blocks.
Block is the smallest unit of data that can store or retrieved from the disk. The default size of the
block is 128MB. HDFS break files into blocks and stores these blocks on different nodes in the
cluster. We have a file of 130 MB, so HDFS will break this file into 2 blocks.
Now, if we want to perform MapReduce operation on the blocks, it will not process, as the 2nd
block is incomplete. InputSplit solves this problem. InputSplit will form a logical grouping of
blocks as a single block. As the InputSplit include a location for the next block. It also includes
the byte offset of the data needed to complete the block.
From this, we can conclude that InputSplit is only a logical chunk of data. i.e. it has just the
information about blocks address or location. Thus, during MapReduce execution, Hadoop scans
through the blocks and create InputSplits. Split act as a broker between block and mapper.
11) What is a Speculative Execution in Hadoop MapReduce?
MapReduce breaks jobs into tasks and these tasks run parallel rather than sequential. Thus
reduces overall execution time. This model of execution is sensitive to slow tasks as they slow
down the overall execution of a job. There are various reasons for the slowdown of tasks like
hardware degradation. But it may be difficult to detect causes since the tasks still complete
successfully. Although it takes more time than the expected time.
Apache Hadoop doesn‟t try to diagnose and fix slow running task. Instead, it tries to detect them
and run backup tasks for them. This is called Speculative execution in Hadoop. These backup
tasks are called Speculative tasks in hadoop. First of all Hadoop framework launch all the tasks
for the job in Hadoop MapReduce. Then it launches speculative tasks for those tasks that have
been running for some time (one minute). And the task that has not made any much progress, on
average, as compared with other tasks from the job. If the original task completes before the
speculative task. Then it will kill the speculative task. On the other hand, it will kill the original
task if the speculative task finishes before it.
12) How to submit extra files(jars, static files) for MapReduce job during runtime?
MapReduce framework provides Distributed Cache to caches files needed by the applications.
It can cache read-only text files, archives, jar files etc.
First of all, an application which needs to use distributed cache to distribute a file should make
sure that the files are available on URLs. Hence, URLs can be either hdfs:// or http://. Now, if
the file is present on the hdfs:// or http://urls. Then, user mentions it to be cache file to distribute.
This framework will copy the cache file on all the nodes before starting of tasks on those nodes.
The files are only copied once per job. Applications should not modify those files.
By default size of the distributed cache is 10 GB. We can adjust the size of distributed cache
using local.cache.size.
Jo bTrac ke r an d Tas kTrac ker I n M A P RED UC E
JobTracker and TaskTracker are 2 essential process involved in MapReduce execution in MRv1
(or Hadoop version 1). Both processes are now deprecated in MRv2 (or Hadoop version 2) and
replaced by Resource Manager, Application Master and Node Manager Daemons.
J o b T r a c k e r –
1. JobTracker process runs on a separate node and not usually on a DataNode.
2. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by
ResourceManager/ApplicationMaster in MRv2.
3. JobTracker receives the requests for MapReduce execution from the client.
4. JobTracker talks to the NameNode to determine the location of the data.
5. JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality
(proximity of the data) and the available slots to execute a task on a given node.
6. JobTracker monitors the individual TaskTrackers and the submits back the overall status
of the job back to the client.
7. JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.
8. When the JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.
T a s k T r a c k e r –
1. TaskTracker runs on DataNode. Mostly on all DataNodes.
2. TaskTracker is replaced by Node Manager in MRv2.
3. Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
4. TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
5. TaskTracker will be in constant communication with the JobTracker signalling the
progress of the task in execution.
6. TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive,
JobTracker will assign the task executed by the TaskTracker to another node.
Hadoop Yarn
Apache Yarn – “Yet Another Resource Negotiator” is the resource management layer
of Hadoop. The Yarn was introduced in Hadoop 2.x. Yarn allows different data processing
engines like graph processing, interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop Distributed File System). Apart
from resource management, Yarn also does job Scheduling. Yarn extends the power of Hadoop
to other evolving technologies, so they can take the advantages of HDFS (most reliable and
popular storage system on the planet) and economic cluster. To learn installation of Apache
Hadoop 2 with Yarn follows this quick installation guide.
Apache yarn is also a data operating system for Hadoop 2.x. This architecture of Hadoop 2.x
provides a general purpose data processing platform which is not just limited to the MapReduce.
It enables Hadoop to process other purpose-built data processing system other than MapReduce.
It allows running several different frameworks on the same hardware where Hadoop is deployed.
Hadoop Yarn Architecture
In this section of Hadoop Yarn tutorial, we will discuss the complete architecture of Yarn.
Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per application).
Resource Manager (RM)
It is the master daemon of Yarn. RM manages the global assignments of resources (CPU and
memory) among all the applications. It arbitrates system resources between competing
applications.
Resource Manager has two Main components
Scheduler
Application manager
a) Scheduler
The scheduler is responsible for allocating the resources to the running application. The
scheduler is pure scheduler it means that it performs no monitoring no tracking for the
application and even doesn‟t guarantees about restarting failed tasks either due to application
failure or hardware failures.
b) Application Manager
It manages running Application Masters in the cluster, i.e., it is responsible for starting
application masters and for monitoring and restarting them on different nodes in case of failures.
Node Manager (NM)
It is the slave daemon of Yarn. NM is responsible for containers monitoring their resource usage
and reporting the same to the ResourceManager. Manage the user process on that machine. Yarn
NodeManager also tracks the health of the node on which it is running. The design also allows
plugging long-running auxiliary services to the NM; these are application-specific services,
specified as part of the configurations and loaded by the NM during startup. A shuffle is a typical
auxiliary service by the NMs for MapReduce applications on YARN
Application Master (AM)
One application master runs per application. It negotiates resources from the resource manager
and works with the node manager. It Manages the application life cycle.
The AM acquires containers from the RM‟s Scheduler before contacting the corresponding NMs
to start the application‟s individual tasks.
When a client wants to write a file to HDFS, it communicates to namenode for metadata. The Namenode responds with details of a number of blocks, replication factor. Then, on basis of
information from NameNode, client split files into multiple blocks. After that client starts sending them to first DataNode. The client sends block A to Datanode 1 with other twoDatanodes details.
When Datanode 1 receives block A sent from the client, Datanode 1 copy same block to
Datanode 2 of the same rack. As both the Datanodes are in the same rack so block transfer via
rack switch. Now Datanode 2 copies the same block to Datanode 3. As both the Datanodes are in
different racks so block transfer via an out-of-rack switch.
After the Datanode receives the blocks from the client. Then Datanode sends write confirmation
to Namenode. Now Datanode sends write confirmation to the client. The Same process will
repeat for each block of the file. Data transfer happen in parallel for faster write of blocks.
5) Can multiple clients write into an HDFS file concurrently?
Multiple clients cannot write into an HDFS file at same time. Apache Hadoop HDFS follows
single writer multiple reader models. The client which opens a file for writing, the NameNode
grant a lease. Now suppose, some other client wants to write into that file. It asks NameNode for
the write operation. NameNode first checks whether it has granted the lease for writing into that
file to someone else or not. When someone already acquires the lease, then, it will reject the
write request of the other client.
6) How data or file is read in HDFS?
To read from HDFS, the first client communicates to namenode for metadata. A client comes out
of namenode with the name of files and its location. The Namenode responds with details of the
number of blocks, replication factor. Now client communicates with Datanode where the blocks
are present. Clients start reading data parallel from the Datanode. It read on the basis of
information received from the namenodes.
Once client or application receives all the blocks of the file, it will combine these blocks to form
a file. For read performance improvement, the location of each block ordered by their distance
from the client. HDFS selects the replica which is closest to the client. This reduces the read
latency and bandwidth consumption. It first read the block in the same node. Then another node
in the same rack, and then finally another Datanode in another rack.
7) Why HDFS stores data using commodity hardware despite the higher chance of
failures?
HDFS stores data using commodity hardware because HDFS is highly fault-tolerant. HDFS
provides fault tolerance by replicating the data blocks. And then distribute it among different
DataNodes across the cluster. By default, replication factor is 3 which is configurable.
Replication of data solves the problem of data loss in unfavorable conditions. And unfavorable
conditions are crashing of the node, hardware failure and so on. So, when any machine in the
cluster goes down, then the client can easily access their data from another machine. And this
machine contains the same copy of data blocks.
8) How is indexing done in HDFS?
Hadoop has a unique way of indexing. Once Hadoop framework store the data as per the block
size. HDFS will keep on storing the last part of the data which will say where the next part of the
data will be. In fact, this is the base of HDFS.
9) What is a Heartbeat in HDFS?
Heartbeat is the signals that NameNode receives from the DataNodes to show that it is
functioning (alive). NameNode and DataNode do communicate using Heartbeat. If after the
certain time of heartbeat NameNode do not receive any response from DataNode, then that Node
is dead. The NameNode then schedules the creation of new replicas of those blocks on other
DataNodes.
Heartbeats from a Datanode also carry information about total storage capacity. Also, carry the
fraction of storage in use, and the number of data transfers currently in progress.
The default heartbeat interval is 3 seconds. One can change it by
using dfs.heartbeat.interval in hdfs-site.xml.
10) How to copy a file into HDFS with a different block size to that of existing block size
configuration?
One can copy a file into HDFS with a different block size by using:
–Ddfs.blocksize=block_size, where block_size is in bytes.
So, let us explain it with an example:
Suppose, you want to copy a file called test.txt of size, say of 128 MB, into the hdfs. And for this
file, you want the block size to be 32MB (33554432 Bytes) in place of the default (128 MB). So,
you would issue the following command:
Hadoop fs –Ddfs.blocksize=33554432 –copyFromlocal/home/dataflair/test.txt/sample_hdfs
Now, you can check the HDFS block size associated with this file by:
hadoop fs –stat %o/sample_hdfs/test.txt
Else, you can also use the NameNode web UI for seeing the HDFS directory.
11) Why HDFS performs replication, although it results in data redundancy?
In HDFS, Replication provides the fault tolerance. Data replication is one of the most important
and unique features of HDFS. Replication of data solves the problem of data loss in unfavorable
conditions. Unfavorable conditions are crashing of the node, hardware failure and so on. HDFS
by default creates 3 replicas of each block across the cluster in Hadoop. And we can change it as
per the need. So, if any node goes down, we can recover data on that node from the other node.
In HDFS, Replication will lead to the consumption of a lot of space. But the user can always add
more nodes to the cluster if required. It is very rare to have free space issues in the practical
cluster. As the very first reason to deploy HDFS was to store huge data sets. Also, one can
change the replication factor to save HDFS space. Or one can also use different codec provided
by the Hadoop to compress the data.
12) What is the default replication factor and how will you change it?
The default replication factor is 3. One can change this in following three ways:
By adding this property to hdfs-site.xml:
1. <property>
2. <name>dfs.replication</name>
3. <value>5</value>
4. <description>Block Replication</description>
5. </property>
You can also change the replication factor on per-file basis using the command:hadoop fs –
setrep –w 3 / file_location
You can also change replication factor for all the files in a directory by using:hadoop fs –
setrep –w 3 –R / directoey_location
13) Explain Hadoop Archives?
Apache Hadoop HDFS stores and processes large (terabytes) data sets. However, storing a large
number of small files in HDFS is inefficient, since each file is stored in a block, and block
metadata is held in memory by the namenode.
Reading through small files normally causes lots of seeks and lots of hopping from datanode to
datanode to retrieve each small file, all of which is inefficient data access pattern.
Hadoop Archive (HAR) basically deals with small files issue. HAR pack a number of small
files into a large file, so, one can access the original files in parallel transparently (without
expanding the files) and efficiently.
Hadoop Archives are special format archives. It maps to a file system directory. Hadoop Archive
always has a *.har extension. In particular, Hadoop MapReduce uses Hadoop Archives as an
Input.
14) What do you mean by the High Availability of a NameNode in Hadoop HDFS?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF), if namenode fails, all clients
including MapReduce jobs would be unable to read, write file or list files. In such event, whole
Hadoop system would be out of service until new namenode is brought online.
Hadoop 2.0 overcomes this single point of failure by providing support for multiple
NameNode. High availability feature provides an extra NameNode (active standby NameNode)
to Hadoop architecture which is configured for automatic failover. If active NameNode fails,
then Standby Namenode takes all the responsibility of active node and cluster continues to work.
The initial implementation of HDFS namenode high availability provided for single active
namenode and single standby namenode. However, some deployment requires high degree faulttolerance, this is enabled by new version 3.0, which allows the user to run multiple standby
namenode. For instance configuring three namenode and five journal nodes, the cluster is able to
tolerate the failure of two nodes rather than one.
15) What is Fault Tolerance in HDFS?
Fault-tolerance in HDFS is working strength of a system in unfavorable conditions ( like the
crashing of the node, hardware failure and so on). HDFS control faults by the process of replica
creation. When client stores a file in HDFS, then the file is divided into blocks and blocks of data
are distributed across different machines present in HDFS cluster. And, It creates a replica of
each block on other machines present in the cluster. HDFS, by default, creates 3 copies of a
block on other machines present in the cluster. If any machine in the cluster goes down or fails
due to unfavorable conditions, then also, the user can easily access that data from other machines
in the cluster in which replica of the block is present.
16) What is Rack Awareness?
Rack Awareness improves the network traffic while reading/writing file. In which NameNode
chooses the DataNode which is closer to the same rack or nearby rack. NameNode achieves rack
information by maintaining the rack IDs of each DataNode. This concept that chooses Datanodes
based on the rack information. In HDFS, NameNode makes sure that all the replicas are not
stored on the same rack or single rack. It follows Rack Awareness Algorithm to reduce latency
as well as fault tolerance.
Default replication factor is 3, according to Rack Awareness Algorithm. Therefore, the first
replica of the block will store on a local rack. The next replica will store on another datanode
within the same rack. And the third replica stored on the different rack.
In Hadoop, we need Rack Awareness because it improves:
Data high availability and reliability.
The performance of the cluster.
Network bandwidth.
17) Explain the Single point of Failure in Hadoop?
In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, all clients
would unable to read/write files. In such event, whole Hadoop system would be out of service
until new namenode is up.
Hadoop 2.0 overcomes this SPOF by providing support for multiple NameNode. High
availability feature provides an extra NameNode to Hadoop architecture. This feature
provides automatic failover. If active NameNode fails, then Standby-Namenode takes all the
responsibility of active node. And cluster continues to work.
The initial implementation of Namenode high availability provided for single active/standby
namenode. However, some deployment requires high degree fault-tolerance. So new version 3.0
enable this feature by allowing the user to run multiple standby namenode. For instance
configuring three namenode and five journal nodes. So, the cluster is able to tolerate the failure
of two nodes rather than one.
18) Explain Erasure Coding in Hadoop?
In Hadoop, by default HDFS replicates each block three times for several purposes. Replication
in HDFS is very simple and robust form of redundancy to shield against the failure of datanode.
But replication is very expensive. Thus, 3 x replication scheme has 200% overhead in storage
space and other resources.
Thus, Hadoop 2.x introduced Erasure Coding a new feature to use in the place of Replication.
It also provides the same level of fault tolerance with less space store and 50% storage overhead.
Erasure Coding uses Redundant Array of Inexpensive Disk (RAID). RAID implements EC
through striping. In which it divide logical sequential data (such as a file) into the smaller unit
(such as bit, byte or block). Then, stores data on different disk.
Encoding- In this process, RAID calculates and sort Parity cells for each strip of data cells. And
recover error through the parity. Erasure coding extends a message with redundant data for fault
tolerance. EC codec operates on uniformly sized data cells. In Erasure Coding, codec takes a
number of data cells as input and produces parity cells as the output. Data cells and parity cells
together are called an erasure coding group.
There are two algorithms available for Erasure Coding:
XOR Algorithm
Reed-Solomon Algorithm
19) What is Disk Balancer in Hadoop?
HDFS provides a command line tool called Diskbalancer. It distributes data evenly on all disks
of a datanode. This tool operates against a given datanode and moves blocks from one disk to
another.
Disk balancer works by creating a plan (set of statements) and executing that plan on the
datanode. Thus, the plan describes how much data should move between two disks. A plan
composes multiple steps. Move step has source disk, destination disk and the number of bytes to
move. And the plan will execute against an operational datanode.
By default, disk balancer is not enabled; Hence, to enable disk
balancer dfs.disk.balancer.enabled must be set true in hdfs-site.xml.
When we write new block in hdfs, then, datanode uses volume choosing the policy to choose the
disk for the block. Each directory is the volume in hdfs terminology. Thus, two such policies are:
Round-robin: It distributes the new blocks evenly across the available disks.
Available space: It writes data to the disk that has maximum free space (by percentage).
20) How would you check whether your NameNode is working or not?
There are several ways to check the status of the NameNode. Mostly, one uses the jps command
to check the status of all daemons running in the HDFS.
21) Is Namenode machine same as DataNode machine as in terms of hardware?
Unlike the DataNodes, a NameNode is a highly available server. That manages the File System
Namespace and maintains the metadata information. Metadata information is a number of
blocks, their location, replicas and other details. It also executes file system execution such as
naming, closing, opening files/directories.
Therefore, NameNode requires higher RAM for storing the metadata for millions of files.
Whereas, DataNode is responsible for storing actual data in HDFS. It performsread and write
operation as per request of the clients. Therefore, Datanode needs to have a higher disk capacity
for storing huge data sets.
22) What are file permissions in HDFS and how HDFS check permissions for files or
directory?
For files and directories, Hadoop distributed file system (HDFS) implements a permissions
model. For each file or directory, thus, we can manage permissions for a set of 3 distinct user
classes:
The owner, group, and others.
The 3 different permissions for each user class: Read (r), write (w), and execute(x).
For files, the r permission is to read the file, and the w permission is to write to the file.
For directories, the r permission is to list the contents of the directory. The wpermission
is to create or delete the directory.
X permission is to access a child of the directory.
HDFS check permissions for files or directory:
We can also check the owner‟s permissions if the username matches the owner of the
directory.
If the group matches the directory‟s group, then Hadoop tests the user‟s group
permissions.
Hadoop tests the “other” permission when the owner and the group names don‟t match.
If none of the permissions checks succeed, the client‟s request is denied.
23) If DataNode increases, then do we need to upgrade NameNode?
Namenode stores meta-data i.e. number of blocks, their location, replicas. This meta-data is
available in memory in the master for faster retrieval of data. NameNode maintains and manages
the slave nodes, and assigns tasks to them. It regulates client‟s access to files.
It also executes file system execution such as naming, closing, opening files/directories.
During Hadoop installation, framework determines NameNode based on the size of the cluster.
Mostly we don‟t need to upgrade the NameNode because it does not store the actual data. But it
stores the metadata, so such requirement rarely arise.
These Big Data Hadoop interview questions are the selected ones which are asked frequently and
by going through these HDFS interview questions you will be able to answer many other related
answers in your interview.
MAP REDUCE
1) What is Hadoop MapReduce?
MapReduce is the data processing layer of Hadoop. It is the framework for writing applications
that process the vast amount of data stored in the HDFS. It processes a huge amount of data in
parallel by dividing the job into a set of independent tasks (sub-job). In Hadoop, MapReduce
works by breaking the processing into phases: Map and Reduce.
Map- It is the first phase of processing. In which we specify all the complex
logic/business rules/costly code. The map takes a set of data and converts it into another set
of data. It also breaks individual elements into tuples (key-value pairs).
Reduce- It is the second phase of processing. In which we specify light-weight
processing like aggregation/summation. Reduce takes the output from the map as input. After
that, it combines tuples (key-value) based on the key. And then, modifies the value of the key
accordingly.
2) Why Hadoop MapReduce?
When we store huge amount of data in HDFS, the first question arises is, how to process this
data?
Transferring all this data to a central node for processing is not going to work. And we will have
to wait forever for the data to transfer over the network. Google faced this same problem with
its Distributed Goggle File System (GFS). It solved this problem using a MapReduce data
processing model.
Challenges before MapReduce:
Costly – All the data (terabytes) in one server or as database cluster which is very
expensive. And also hard to manage.
Time-consuming – By using single machine we cannot analyze the data (terabytes) as it
will take a lot of time.
MapReduce overcome these challenges:
Cost-efficient – It distribute the data over multiple low config machines.
Time-efficient – If we want to analyze the data. We can write the analysis code in Map
function. And the integration code in Reduce function and execute it. Thus, this MapReduce
code will go to every machine which has a part of our data and executes on that specific part.
Hence instead of moving terabytes of data, we just move kilobytes of code. So this type of
movement is time-efficient.
3) What is the key- value pair in MapReduce?
Hadoop MapReduce implements a data model, which represents data as key-value pairs. Both
input and output to MapReduce Framework should be in Key-value pairs only.
In Hadoop, if the schema is static we can directly work on the column instead of key-value. But,
the schema is not static we will work on keys and values. Keys and values are not the intrinsic
properties of the data. But the user analyzing the data chooses a key-value pair. A Key-value pair
in Hadoop MapReduce generate in following way:
InputSplit- It is the logical representation of data. InputSplit represents the data which
individual Mapper will process.
RecordReader- It communicates with the InputSplit (created by InputFormat). And
converts the split into records. Records are in form of Key-value pairs that are suitable for
reading by the mapper. By Default RecordReader uses TextInputFormat for converting data
into a key-value pair.
Key- It is the byte offset of the beginning of the line within the file, so it will be unique if
combined with the file.
Value- It is the contents of the line, excluding line terminators. For Example file content is- on
the top of the crumpetty Tree
Key- 0
Value- on the top of the crumpetty Tree
4) Why MapReduce uses the key-value pair to process the data?
MapReduce works on unstructured and semi-structured data apart from structured data. One can
read the Structured data like the ones stored in RDBMS by columns. But handling unstructured
data is feasible using key-value pairs. And the very core idea of MapReduce work on the basis of
these pairs. Framework map data into a collection of key-value pairs by mapper and reducer on
all the pairs with the same key. So as stated by Google themselves in their research publication.
In most of the computationsMap operation applies on each logical “record” in our input. This computes a set of intermediate
key-value pairs. Then apply reduce operation on all the values that share the same key. This
combines the derived data properly.
In conclusion, we can say that key-value pairs are the best solution to work on data problems on
MapReduce.
5) How many Mappers run for a MapReduce job in Hadoop?
Mapper task processes each input record (from RecordReader) and generates a key-value pair.
The number of mappers depends on 2 factors:
The amount of data we want to process along with block size. It depends on the number
of InputSplit. If we have the block size of 128 MB and we expect 10TB of input data, thus
we will have 82,000 maps. Ultimately InputFormat determines the number of maps.
The configuration of the slave i.e. number of core and RAM available on the slave. The
right number of map/node can between 10-100. Hadoop framework should give 1 to 1.5
cores of the processor to each mapper. Thus, for a 15 core processor, 10 mappers can run.
In MapReduce job, by changing the block size one can control the number of Mappers.
Hence, by Changing block size the number of InputSplit increases or decreases.
By using the JobConf‟s conf.setNumMapTasks(int num) one can increase the number of map
tasks manually.
Mapper= {(total data size)/ (input split size)}
If data size= 1 Tb and input split size= 100 MB
Hence, Mapper= (1000*1000)/100= 10,000
6) How many Reducers run for a MapReduce job in Hadoop?
Reducer takes a set of an intermediate key-value pair produced by the mapper as the input. Then
runs a reduce function on each of them to generate the output. Thus, the output of the reducer is
the final output, which it stores in HDFS. Usually, in the reducer, we do aggregation or
summation sort of computation.
With the help of Job.setNumreduceTasks(int) the user set the number of reducers for the job.
Hence the right number of reducers are set by the formula:
0.95 Or 1.75 multiplied by (<no. of nodes> * <no. of the maximum container per node>).
With 0.95, all the reducers can launch immediately and start transferring map outputs as the map
finish.
With 1.75, faster node finishes the first round of reduces and then launch the second wave of
reduces.
By increasing the number of reducers:
Framework overhead increases
Increases load balancing
Lowers the cost of failures
7) What is the difference between Reducer and Combiner in Hadoop MapReduce?
The Combiner is Mini-Reducer that perform local reduce task. The Combiner runs on the Map
output and produces the output to reducer input. A combiner is usually used for network
optimization. Reducer takes a set of an intermediate key-value pair produced by the mapper as
the input. Then runs a reduce function on each of them to generate the output. An output of the
reducer is the final output.
Unlike a reducer, the combiner has a limitation. i.e. the input or output key and value
types must match the output types of the mapper.
Combiners can operate only on a subset of keys and values. i.e. combiners can execute on
functions that are commutative.
Combiner functions take input from a single mapper. While reducers can take data from
multiple mappers as a result of partitioning.
8) What happens if the number of reducers is 0 in Hadoop?
If we set the number of reducer to 0, then no reducer will execute and no aggregation will take
place. In such case, we will prefer “Map-only job” in Hadoop. In a map-Only job, the map does
all task with its InputSplit and the reducer does no job. Map output is the final output.
Between map and reduce phases there is key, sort, and shuffle phase. Sort and shuffle phase are
responsible for sorting the keys in ascending order. Then grouping values based on same keys.
This phase is very expensive. If reduce phase is not required we should avoid it. Avoiding reduce
phase would eliminate sort and shuffle phase as well. This also saves network congestion. As in
shuffling an output of mapper travels to the reducer, when data size is huge, large data travel to
the reducer.
9) What do you mean by shuffling and sorting in MapReduce?
Shuffling and Sorting takes place after the completion of map task. Shuffle and sort phase in
hadoop occurs simultaneously.
Shuffling- It is the process of transferring data from the mapper to reducer. i.e., the process by
which the system sorts the key-value output of the map tasks and transfer it to the reducer.
So, shuffle phase is necessary for reducer, otherwise, they would not have any input. As
shuffling can start even before the map phase has finished. So this saves some time and
completes the task in lesser time.
Sorting- Mapper generate the intermediate key-value pair. Before starting of reducer,
MapReduce framework sort these key-value pairs by the keys.
Sorting helps reducer to easily distinguish when a new reduce task should start. Thus saves time
for the reducer.
Shuffling and sorting are not performed at all if you specify zero reducers
(setNumReduceTasks(0)).
Any doubt in the Hadoop Interview Questions and Answers yet? Just Drop a comment and we
will get back to you.
10) What is the fundamental difference between a MapReduce InputSplit and HDFS
block?
By definition
Block – Block is the continuous location on the hard drive where data HDFS store data.
In general, FileSystem store data as a collection of blocks. In a similar way, HDFS stores
each file as blocks, and distributes it across the Hadoop cluster.
InputSplit- InputSplit represents the data which individual Mapper will process. Further
split divides into records. Each record (which is a key-value pair) will be processed by the
map.
Data representation
Block- It is the physical representation of data.
InputSplit- It is the logical representation of data. Thus, during data processing in
MapReduce program or other processing techniques use InputSplit. In MapReduce,
important thing is that InputSplit does not contain the input data. Hence, it is just a reference
to the data.
Size
Block- The default size of the HDFS block is 128 MB which is configured as per our
requirement. All blocks of the file are of the same size except the last block. The last Block
can be of same size or smaller. In Hadoop, the files split into 128 MB blocks and then stored
into Hadoop Filesystem.
InputSplit- Split size is approximately equal to block size, by default.
Example
Consider an example, where we need to store the file in HDFS. HDFS stores files as blocks.
Block is the smallest unit of data that can store or retrieved from the disk. The default size of the
block is 128MB. HDFS break files into blocks and stores these blocks on different nodes in the
cluster. We have a file of 130 MB, so HDFS will break this file into 2 blocks.
Now, if we want to perform MapReduce operation on the blocks, it will not process, as the 2nd
block is incomplete. InputSplit solves this problem. InputSplit will form a logical grouping of
blocks as a single block. As the InputSplit include a location for the next block. It also includes
the byte offset of the data needed to complete the block.
From this, we can conclude that InputSplit is only a logical chunk of data. i.e. it has just the
information about blocks address or location. Thus, during MapReduce execution, Hadoop scans
through the blocks and create InputSplits. Split act as a broker between block and mapper.
11) What is a Speculative Execution in Hadoop MapReduce?
MapReduce breaks jobs into tasks and these tasks run parallel rather than sequential. Thus
reduces overall execution time. This model of execution is sensitive to slow tasks as they slow
down the overall execution of a job. There are various reasons for the slowdown of tasks like
hardware degradation. But it may be difficult to detect causes since the tasks still complete
successfully. Although it takes more time than the expected time.
Apache Hadoop doesn‟t try to diagnose and fix slow running task. Instead, it tries to detect them
and run backup tasks for them. This is called Speculative execution in Hadoop. These backup
tasks are called Speculative tasks in hadoop. First of all Hadoop framework launch all the tasks
for the job in Hadoop MapReduce. Then it launches speculative tasks for those tasks that have
been running for some time (one minute). And the task that has not made any much progress, on
average, as compared with other tasks from the job. If the original task completes before the
speculative task. Then it will kill the speculative task. On the other hand, it will kill the original
task if the speculative task finishes before it.
12) How to submit extra files(jars, static files) for MapReduce job during runtime?
MapReduce framework provides Distributed Cache to caches files needed by the applications.
It can cache read-only text files, archives, jar files etc.
First of all, an application which needs to use distributed cache to distribute a file should make
sure that the files are available on URLs. Hence, URLs can be either hdfs:// or http://. Now, if
the file is present on the hdfs:// or http://urls. Then, user mentions it to be cache file to distribute.
This framework will copy the cache file on all the nodes before starting of tasks on those nodes.
The files are only copied once per job. Applications should not modify those files.
By default size of the distributed cache is 10 GB. We can adjust the size of distributed cache
using local.cache.size.
Jo bTrac ke r an d Tas kTrac ker I n M A P RED UC E
JobTracker and TaskTracker are 2 essential process involved in MapReduce execution in MRv1
(or Hadoop version 1). Both processes are now deprecated in MRv2 (or Hadoop version 2) and
replaced by Resource Manager, Application Master and Node Manager Daemons.
J o b T r a c k e r –
1. JobTracker process runs on a separate node and not usually on a DataNode.
2. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced by
ResourceManager/ApplicationMaster in MRv2.
3. JobTracker receives the requests for MapReduce execution from the client.
4. JobTracker talks to the NameNode to determine the location of the data.
5. JobTracker finds the best TaskTracker nodes to execute tasks based on the data locality
(proximity of the data) and the available slots to execute a task on a given node.
6. JobTracker monitors the individual TaskTrackers and the submits back the overall status
of the job back to the client.
7. JobTracker process is critical to the Hadoop cluster in terms of MapReduce execution.
8. When the JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.
T a s k T r a c k e r –
1. TaskTracker runs on DataNode. Mostly on all DataNodes.
2. TaskTracker is replaced by Node Manager in MRv2.
3. Mapper and Reducer tasks are executed on DataNodes administered by TaskTrackers.
4. TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.
5. TaskTracker will be in constant communication with the JobTracker signalling the
progress of the task in execution.
6. TaskTracker failure is not considered fatal. When a TaskTracker becomes unresponsive,
JobTracker will assign the task executed by the TaskTracker to another node.
Hadoop Yarn
Apache Yarn – “Yet Another Resource Negotiator” is the resource management layer
of Hadoop. The Yarn was introduced in Hadoop 2.x. Yarn allows different data processing
engines like graph processing, interactive processing, stream processing as well as batch
processing to run and process data stored in HDFS (Hadoop Distributed File System). Apart
from resource management, Yarn also does job Scheduling. Yarn extends the power of Hadoop
to other evolving technologies, so they can take the advantages of HDFS (most reliable and
popular storage system on the planet) and economic cluster. To learn installation of Apache
Hadoop 2 with Yarn follows this quick installation guide.
Apache yarn is also a data operating system for Hadoop 2.x. This architecture of Hadoop 2.x
provides a general purpose data processing platform which is not just limited to the MapReduce.
It enables Hadoop to process other purpose-built data processing system other than MapReduce.
It allows running several different frameworks on the same hardware where Hadoop is deployed.
Hadoop Yarn Architecture
In this section of Hadoop Yarn tutorial, we will discuss the complete architecture of Yarn.
Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per application).
Resource Manager (RM)
It is the master daemon of Yarn. RM manages the global assignments of resources (CPU and
memory) among all the applications. It arbitrates system resources between competing
applications.
Resource Manager has two Main components
Scheduler
Application manager
a) Scheduler
The scheduler is responsible for allocating the resources to the running application. The
scheduler is pure scheduler it means that it performs no monitoring no tracking for the
application and even doesn‟t guarantees about restarting failed tasks either due to application
failure or hardware failures.
b) Application Manager
It manages running Application Masters in the cluster, i.e., it is responsible for starting
application masters and for monitoring and restarting them on different nodes in case of failures.
Node Manager (NM)
It is the slave daemon of Yarn. NM is responsible for containers monitoring their resource usage
and reporting the same to the ResourceManager. Manage the user process on that machine. Yarn
NodeManager also tracks the health of the node on which it is running. The design also allows
plugging long-running auxiliary services to the NM; these are application-specific services,
specified as part of the configurations and loaded by the NM during startup. A shuffle is a typical
auxiliary service by the NMs for MapReduce applications on YARN
Application Master (AM)
One application master runs per application. It negotiates resources from the resource manager
and works with the node manager. It Manages the application life cycle.
The AM acquires containers from the RM‟s Scheduler before contacting the corresponding NMs
to start the application‟s individual tasks.
No comments:
Post a Comment