Pages

Sunday 19 March 2023

Hadoop - Important viva Questions

 1) What is Apache Hadoop?
  • Hadoop emerged as a solution to the “Big Data” problems. It is a part of the Apache project sponsored by the Apache Software Foundation (ASF). It is an open source software framework for distributed storage and distributed processing of large data sets. Open source means it is freely available and even we can change its source code as per our requirements. Apache Hadoop makes it possible to run applications on the system with thousands of commodity hardware nodes. It‟s distributed file system has the provision of rapid data transfer rates among nodes. It also allows the system to continue operating in case of node failure. Apache Hadoop provides:
  • Storage layer – HDFS
  • Batch processing engine – MapReduce
  • Resource Management Layer – YARN
2) Why do we need Hadoop?
  • The picture of Hadoop came into existence to deal with Big Data challenges. The challenges with Big Data are-
  • Storage – Since data is very large, so storing such huge amount of data is very difficult.
  • Security – Since the data is huge in size, keeping it secure is another challenge.
  • Analytics – In Big Data, most of the time we are unaware of the kind of data we are dealing with. So analyzing that data is even more difficult.
  • Data Quality – In the case of Big Data, data is very messy, inconsistent and incomplete.
  • Discovery – Using a powerful algorithm to find patterns and insights are very difficult.Hadoop is an open-source software framework that supports the storage and processing of large data sets. Apache Hadoop is the best solution for storing and processing Big data because:Apache Hadoop stores huge files as they are (raw) without specifying any schema.
  • High scalability – We can add any number of nodes, hence enhancing performance dramatically.
  • Reliable – It stores data reliably on the cluster despite machine failure.
  • High availability – In Hadoop data is highly available despite hardware failure. If a machine or hardware crashes, then we can access data from another path.
  • Economic – Hadoop runs on a cluster of commodity hardware which is not very expensive
3) What are the core components of Hadoop?
  • Hadoop is an open-source software framework for distributed storage and processing of large datasets. Apache Hadoop core components are HDFS, MapReduce, and YARN.
  • HDFS- Hadoop Distributed File System (HDFS) is the primary storage system of Hadoop. HDFS store very large files running on a cluster of commodity hardware. It works on the principle of storage of less number of large files rather than the huge number of small files. HDFS stores data reliably even in the case of hardware failure. It provides high throughput access to an application by accessing in parallel.
  • MapReduce- MapReduce is the data processing layer of Hadoop. It writes an application that processes large structured and unstructured data stored in HDFS. MapReduce processes a huge amount of data in parallel. It does this by dividing the job (submitted job) into a set of independent tasks (sub-job). In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce. The Map is the first phase of processing, where we specify all the complex logic code. Reduce is the second phase of processing. Here we specify light-weight processing like aggregation/summation.
  • YARN- YARN is the processing framework in Hadoop. It provides Resource management and allows multiple data processing engines. For example real-time streaming, data science, and batch processing.
4) What are the Features of Hadoop?
  • The various Features of Hadoop are:
  • Open Source – Apache Hadoop is an open source software framework. Open source means it is freely available and even we can change its source code as per our requirements.
  • Distributed processing – As HDFS stores data in a distributed manner across the cluster. MapReduce process the data in parallel on the cluster of nodes.
  • Fault Tolerance – Apache Hadoop is highly Fault-Tolerant. By default, each block creates 3 replicas across the cluster and we can change it as per needment. So if any node goes down, we can recover data on that node from the other node. Framework recovers failures of nodes or tasks automatically.
  • Reliability – It stores data reliably on the cluster despite machine failure.
  • High Availability – Data is highly available and accessible despite hardware failure. In Hadoop, when a machine or hardware crashes, then we can access data from another path.
  • Scalability – Hadoop is highly scalable, as one can add the new hardware to the nodes.
  • Economic- Hadoop runs on a cluster of commodity hardware which is not very expensive. We do not need any specialized machine for it.
  • Easy to use – No need of client to deal with distributed computing, the framework take care of all the things. So it is easy to use.
5) Compare Hadoop and RDBMS?
  • Apache Hadoop is the future of the database because it stores and processes a large amount of data. Which will not be possible with the traditional database. There is some difference between Hadoop and RDBMS which are as follows:
  • Architecture – Traditional RDBMS have ACID properties. Whereas Hadoop is distributed computing framework having two main components: Distributed file system (HDFS) and MapReduce.
  • Data acceptance – RDBMS accepts only structured data. While Hadoop can accept both structured as well as unstructured data. It is a great feature of hadoop, as we can store everything in our database and there will be no data loss.
  • Scalability – RDBMS is a traditional database which provides vertical scalability. So if the data increases for storing then we have to increase particular system configuration. While Hadoop provides horizontal scalability. So we just have to add one or more node to the cluster if there is any requirement for an increase in data.
  • OLTP (Real-time data processing) and OLAP – Traditional RDMS support OLTP (Real-time data processing). OLTP is not supported in Apache Hadoop. Apache Hadoop supports large scale Batch Processing workloads (OLAP).
  • Cost – Licensed software, therefore we have to pay for the software. Whereas Hadoop is open source framework, so we don‟t need to pay for software.
6) What are the modes in which Hadoop run?
  • Apache Hadoop runs in three modes:
  • Local (Standalone) Mode – Hadoop by default run in a single-node, non-distributed mode, as a single Java process. Local mode uses the local file system for input and output operation. It is also used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for configuration files.
  • Pseudo-Distributed Mode – Just like the Standalone mode, Hadoop also runs on a single-node in a Pseudo-distributed mode. The difference is that each daemon runs in a separate Java process in this Mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
  • Fully-Distributed Mode – In this mode, all daemons execute in separate nodes forming a multi-node cluster. Thus, it allows separate nodes for Master and Slave
7) What are the features of Standalone (local) mode
  • By default, Hadoop run in a single-node, non-distributed mode, as a single Java process. Local mode uses the local file system for input and output operation. One can also use it for debugging purpose. It does not support the use of HDFS. Standalone mode is suitable only for running programs during development for testing. Further, in this mode, there is no custom configuration required for configuration files. Configuration files are:
  • core-site.xml
  • hdfs-site.xml files.
  • mapred-site.xml
  • yarn-default.xml
8) What are the features of Pseudo mode?
  • Just like the Standalone mode, Hadoop can also run on a single-node in this mode. The difference is that each Hadoop daemon runs in a separate Java process in this Mode. In Pseudodistributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.The pseudo mode is suitable for both for development and in the testing environment. In the Pseudo mode, all the daemons run on the same machine.
9) What are the features of Fully-Distributed mode?
  • In this mode, all daemons execute in separate nodes forming a multi-node cluster. Thus, we allow separate nodes for Master and Slave.We use this mode in the production environment, where „n‟ number of machines forming a cluster. Hadoop daemons run on a cluster of machines. There is one host onto which NameNode is running and the other hosts on which DataNodes are running. Therefore, NodeManager installs on every DataNode. And it is also responsible for the execution of the task on every single DataNode.The ResourceManager manages all these NodeManager. ResourceManager receives the processing requests. After that, it passes the parts of the request to corresponding NodeManager accordingly.
10) What are configuration files in Hadoop?
  • Core-site.xml – It contain configuration setting for Hadoop core such as I/O settings that are common to HDFS & MapReduce. It use Hostname and port .The most commonly used port is 9000.
  • 1. <configuration>
  • 2. <property>
  • 3. <name>fs.defaultFS</name>
  • 4. <value>hdfs://localhost:9000</value>
  • 5. </property>
  • 6. </configuration>
  • hdfs-site.xml – This file contains the configuration setting for HDFS daemons. hdfs-site.xml also specify default block replication and permission checking on HDFS.
  • 1. <configuration
  • <property>
  • 3. <name>dfs.replication</name>
  • 4. <value>1</value>
  • 5. </property>
  • 6. </configuration>
  • mapred-site.xml – In this file, we specify a framework name for MapReduce. we can specify by 
  • setting the mapreduce.framework.name.
  • 1. <configuration>
  • 2. <property>
  • 3. <name>mapreduce.framework.name</name>
  • 4. <value>yarn</value>
  • 5. </property>
  • 6. </configuration>
  • yarn-site.xml – This file provide configuration setting for NodeManager and ResourceManager.
  • 1. <configuration>
  • 2. <property>
  • 3. <name>yarn.nodemanager.aux-services</name>
  • 4. <value>mapreduce_shuffle</value>
  • 5. </property>
  • 6. <property>
  • 7. <name>yarn.nodemanager.env-whitelist</name> <value>
  • 8.JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value> </property>
  • 9. </configuration>
11) What are the limitations of Hadoop?
  • Various limitations of Hadoop are:
  • Issue with small files – Hadoop is not suited for small files. Small files are the major problems in HDFS. A small file is significantly smaller than the HDFS block size (default 128MB). If you are storing these large number of small files, HDFS can‟t handle these lots of files. As HDFS works with a small number of large files for storing data sets rather than larger number of small files. If one use the huge number of small files, then this will overload the namenode. Since namenode stores the namespace of HDFS.HAR files, Sequence files, and Hbase overcome small files issues.
  • Processing Speed – With parallel and distributed algorithm, MapReduce process large data sets. MapReduce performs the task: Map and Reduce. MapReduce requires a lot of time to perform these tasks thereby increasing latency. As data is distributed and processed over the cluster in MapReduce. So, it will increase the time and reduces processing speed.
  • Support only Batch Processing – Hadoop supports only batch processing. It does not process streamed data and hence, overall performance is slower. MapReduce framework does not leverage the memory of the cluster to the maximum.
  • Iterative Processing – Hadoop is not efficient for iterative processing. As Hadoop does not support cyclic data flow. That is the chain of stages in which the input to the next stage is the output from the previous stage
  • Vulnerable by nature – Hadoop is entirely written in Java, a language most widely used. Hence java been most heavily exploited by cyber-criminal. Therefore it implicates in numerous security breaches.
  •  Security- Hadoop can be challenging in managing the complex application. Hadoop is missing encryption at storage and network levels, which is a major point of concern. Hadoop supports Kerberos authentication, which is hard to manage.
12) Compare Hadoop 2 and Hadoop 3?
  • In Hadoop 2, minimum supported version of Java is Java 7, while in Hadoop 3 is Java 8.
  • Hadoop 2, handle fault tolerance by replication (which is wastage of space). While Hadoop 3 handle it by Erasure coding.
  • For data balancing Hadoop 2 uses HDFS balancer. While Hadoop 3 uses Intra-datanode balancer.
  • In Hadoop 2 some default ports are Linux ephemeral port range. So at the time of startup, they will fail to bind. But in Hadoop 3 these ports have been moved out of the ephemeral range.
  • In hadoop 2, HDFS has 200% overhead in storage space. While Hadoop 3 has 50% overhead in storage space.
  • Hadoop 2 has features to overcome SPOF (single point of failure). So whenever NameNode fails, it recovers automatically. Hadoop 3 recovers SPOF automatically no need of manual intervention to overcome it.
13) Explain Data Locality in Hadoop?
  • Hadoop major drawback was cross-switch network traffic due to the huge volume of data. To overcome this drawback, Data locality came into the picture. It refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. Data locality increases the overall throughput of the system. In Hadoop, HDFS stores datasets. Datasets are divided into blocks and stored across the datanodes in Hadoop cluster. When a user runs the MapReduce job then NameNode sends this MapReduce code to the datanodes on which data is available related to MapReduce job. Data locality has three categories:
  • Data local – In this category data is on the same node as the mapper working on the data. In such case, the proximity of the data is closer to the computation. This is the most preferred scenario.
  • Intra – Rack- In this scenarios mapper run on the different node but on the same rack. As it is not always possible to execute the mapper on the same datanode due to constraints.
  • Inter-Rack – In this scenarios mapper run on the different rack. As it is not possible to execute mapper on a different node in the same rack due to resource constraints
14) What is Safemode in Hadoop?
  • Safemode in Apache Hadoop is a maintenance state of NameNode. During which NameNode doesn‟t allow any modifications to the file system. During Safemode, HDFS cluster is in readonly and doesn‟t replicate or delete blocks. At the startup of NameNode:
  • It loads the file system namespace from the last saved FsImage into its main memory and 
  • the edits log file.
  • Merges edits log file on FsImage and results in new file system namespace.
  • Then it receives block reports containing information about block location from all datanodes.In SafeMode NameNode perform a collection of block reports from datanodes. NameNode enters safemode automatically during its start up. NameNode leaves Safemode after the DataNodes have reported that most blocks are available. Use the command:hadoop dfsadmin –safemode get: To know the status of Safemodebin/hadoop dfsadmin –safemode enter: To enter Safemodehadoop dfsadmin -safemode leave: To come out of SafemodeNameNode front page shows whether safemode is on or off.
15) What is the problem with small files in Hadoop?
  • Hadoop is not suited for small data. Hadoop HDFS lacks the ability to support the random reading of small files. Small file in HDFS is smaller than the HDFS block size (default 128 MB). If we are storing these huge numbers of small files, HDFS can‟t handle these lots of files. HDFS works with the small number of large files for storing large datasets. It is not suitable for a large number of small files. A large number of many small files overload NameNode since it stores the namespace of HDFS.
  • Solution:
  • HAR (Hadoop Archive) Files deal with small file issue. HAR has introduced a layer on top of HDFS, which provide interface for file accessing. Using Hadoop archive command we can create HAR files. These file runs a MapReduce job to pack the archived files into a smaller number of HDFS files. Reading through files in as HAR is not more efficient than reading through files in HDFS. Since each HAR file access requires two index files read as well the data file to read, this makes it slower.Sequence Files also deal with small file problem. In this, we use the filename as key and the file contents as the value. If we have 10,000 files of 100 KB, we can write a program to put them into a single sequence file. And then we can process them in a streaming fashion.
16) What is a “Distributed Cache” in Apache Hadoop?
  • In Hadoop, data chunks process independently in parallel among DataNodes, using a program written by the user. If we want to access some files from all the DataNodes, then we will put that file to distributed cache MapReduce framework provides Distributed Cache to caches files needed by the applications. 
  • t can cache read-only text files, archives, jar files etc.Once we have cached a file for our job. Then, Hadoop will make it available on each datanodes where map/reduce tasks are running. Then, we can access files from all the datanodes in our map and reduce job.An application which needs to use distributed cache should make sure that the files are available on URLs. URLs can be either hdfs:// or http://. Now, if the file is present on the mentioned URLs. The user mentions it to be cache file to distributed cache. This framework will copy the cache file on all the nodes before starting of tasks on those nodes. By default size of distributed cache is 10 GB. We can adjust the size of distributed cache using local.cache.size.
17) How is security achieved in Hadoop?
  • Apache Hadoop achieves security by using Kerberos.At a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server.
  • Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).
  • Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.
  • Service Request – The client uses the service ticket to authenticate itself to the server.
18) Why does one remove or add nodes in a Hadoop cluster frequently
  • The most important features of the Hadoop is its utilization of Commodity hardware. However, this leads to frequent Datanode crashes in a Hadoop cluster.Another striking feature of Hadoop is the ease of scale by the rapid growth in data volume.Hence, due to above reasons, administrator Add/Remove DataNodes in a Hadoop Cluster.
19) What is throughput in Hadoop?
  • The amount of work done in a unit time is Throughput. Because of bellow reasons HDFS provides good throughput:
  • The HDFS is Write Once and Read Many Model. It simplifies the data coherency issues as the data written once, one can not modify it. Thus, provides high throughput data access.
  • Hadoop works on Data Locality principle. This principle state that moves computation to data instead of data to computation. This reduces network congestion and therefore, enhances the overall system throughput.
20) How to restart NameNode or all the daemons in Hadoop?
  • By following methods we can restart the NameNode:
  • You can stop the NameNode individually using /sbin/hadoop-daemon.sh stop namenode command. Then start the NameNode using /sbin/hadoop-daemon.sh start namenode.
  • Use /sbin/stop-all.sh and the use /sbin/start-all.sh, command which will stop all the demons first. Then start all the daemons.The sbin directory inside the Hadoop directory store these script files.
21) What does jps command do in Hadoop?
  • The jbs command helps us to check if the Hadoop daemons are running or not. Thus, it shows all the Hadoop daemons that are running on the machine. Daemons are Namenode, Datanode, ResourceManager, NodeManager etc.
22) What are the main hdfs-site.xml properties?
  • hdfs-site.xml – This file contains the configuration setting for HDFS daemons. hdfs-site.xml also specify default block replication and permission checking on HDFS.The three main hdfs-site.xml properties are:
  • 1. dfs.name.dir gives you the location where NameNode stores the metadata (FsImage and edit logs). And also specify where DFS should locate – on the disk or in the remote directory.
  • 2. dfs.data.dir gives the location of DataNodes where it stores the data.
  • 3. fs.checkpoint.dir is the directory on the file system. On which secondary NameNode stores the temporary images of edit logs. Then this EditLogs and FsImage will merge for backup.

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...