Pages

Friday, 3 February 2023

Exp 3:Installation of HADOOP in PSEUDO DISTRIBUTION MODE.

Title:
Installation of HADOOP in PSEUDO DISTRIBUTION MODE.

Objective:
Pseudo distribution is the distribution mode of operation of Hadoop and it runs on a single node ( a node is your machine).

Requirements:
Software Requirements:

  • Oracle Virtual Box
  • Ubuntu Desktop OS (64bit)
  • Hadoop-3.1.0
  • OpenJdk version-8
  • SSH

Hardware Requirements:

  • Minimum RAM required: 4GB (Suggested: 8GB)
  • Minimum Free Disk Space: 25GB
  • Minimum Processor i3 or above

Analysis:

Pseudo-Distributed mode stands between the standalone mode and fully distributed mode on a production level cluster. It is used to simulate the actual cluster. It simulated 2 node — a master and a slave by running JVM process. it gives you a fully-fledged test environment. HDFS is used for storage using some portion of your disk space and YARN needs to run to manage resources on this Hadoop installation.

Installation Procedure In UBUNTU:

1. open terminal

2. sudo apt update

Install OpenSSH on Ubuntu

3.sudo apt install openssh-server openssh-client -y

Enable Passwordless SSH for Hadoop User

4.ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

/*Use the cat command to store the public key as authorized_keys in the ssh directory*/ 

5.cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

/*Set the permissions for your user with the chmod command*/

6.chmod 0600 ~/.ssh/authorized_keys

/*The new user is now able to SSH without needing to enter a password every time*/

7.ssh localhost

8.logout

Java Installation

9. sudo apt install openjdk-8-jdk

(java home : /usr/lib/jvm/Java-8-openjdk-amd64)

Installation of HADOOP

/* file extraction */

10. tar -zxvf hadoop-3.3.4.tar.gz

/* creatio of hadoop home directory */
11. sudo mkdir /usr/lib/hadoop3
 /* change ownership to hadoop3 */
12. sudo chown username /use/lib/hadoop3
 

/* Move extracted file to hadoop home directory */
13. sudo mv hadoop-3.3.4/* /usr/lib/hadoop3
 

14 cd /usr/lib/hadoop3
cd /home/username (change directory to ubuntu home)
pwd (present working directory)
Setting of paths
 

15. sudo gedit ~/.bashrc
( set the HADOOP_PREFIX & JAVA_HOME at the end of the file)
export HADOOP_PREFIX=/usr/lib/hadoop3
export PATH =$PATH:$HADOOP_PREFIX/bin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH =$PATH:$JAVA_HOME/bin
 

/* execution of bash profile
16. exec bash

/* Reloading of bash profile
17. source ~/.bashrc
 

18 cd /usr/lib/hadoop3/etc/hadoop
 

19. ls (find hadoop-env.sh)
 

20. sudo gedit hadoop-env.sh (optional)
 

Edit the XML files in hadoop
21. cd /usr/lib/hadoop3/etc/hadoop
 

Configure All the Following files by using gedit
sudo gedit core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

sudo gedit hdfs-site.xml

<configuration>
<property> 
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

 

sudo gedit mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

sudo gedit yarn-site.xml 
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADO
OP_YARN_HOME,HADOOP_MAPRED_HOME
</value>
</property>
</configuration>
 

RUNNING of HADOOP SERVICES
22. hadoop namenode -format
/* starting of the dfs services - Its starts Name Node, Data Node, Secondary Name
Node */
23. sbin/start-dfs.sh

/*Starting of Yarn Services - Its starts Resource manager and Node Manager */
24. sbin/start-yarn.sh
23. jps

Limitations:
Pseudo Distribution mode is the partial distribution mode of operation of Hadoop and it runs on a single node ( a node is your machine) where HDFS and YARN services will run in the individual JVMs but resides in the Same System.
 

Conclusion:
The Hadoop daemons run on a local machine, thus simulating a cluster on a small scale.Different Hadoop daemons run in different JVM instances, but on a single machine. HDFS is used instead of local FS. As far as pseudo-distributed setup is concerned, you need to set at least following 4 properties along with JAVA_HOME which are core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml.

Thursday, 2 February 2023

Exp2: Running a MapReduce WordCount Example in STANDALONE MODE.

Title:
Running a MapReduce WordCount Example in STANDALONE MODE.
 
Objective:
Standalone mode is the default mode of operation of Hadoop and it runs on a single node ( a node is your machine) can be used to run the Word count program.
 
Requirements:
software Requirements:
Oracle Virtual Box
Ubuntu Desktop OS (64bit)
Hadoop-3.1.0
OpenJdk version-8
 
Hardware Requirements:
Minimum RAM required: 4GB (Suggested: 8GB)
Minimum Free Disk Space: 25GB
Minimum Processor i3 or above

Analysis:
By default, Hadoop is configured to run in a non-distributed or standalone mode, as a single Java process. There are no daemons running and everything runs in a single JVM instance. HDFS is not used. We need to create a input directory and provided with a sample text file as input to count the number each word occurrences using MapReduce Program in StandAlone Mode.
 
Flow Chart of Map Reduce
To following diagram summarizes the flow of Map reduce algorithm
 
Algorithm:
1. The input data can be divided into n number of chunks depending upon the amount of
data and processing capacity of individual unit.
2. Next, it is passed to the mapper functions. Please note that all the chunks are processed
simultaneously at the same time, which embraces the parallel processing of data.
3. After that, shuffling happens which leads to aggregation of similar patterns.
4. Finally, reducers combine them all to get a consolidated output as per the logic.
5. This algorithm embraces scalability as depending on the size of the input data, we can keep increasing the number of the parallel processing units. 
 
 
Example:
What is MapReduce in Hadoop? Big Data Architecture 
 

Installation steps:

bda@bda-VirtualBox:~$ ls
Desktop    Downloads         hadoop-3.3.4         Music  output1   Public   Templates
Documents  examples.desktop  hadoop-3.3.4.tar.gz  out2   Pictures  sam.txt  Videos

bda@bda-VirtualBox:~$ cd /usr/lib/hadoop3/

bda@bda-VirtualBox:/usr/lib/hadoop3$ ls
bin  etc  include  lib  libexec  LICENSE-binary  licenses-binary  LICENSE.txt  NOTICE-binary  NOTICE.txt  README.txt  sbin  share

bda@bda-VirtualBox:/usr/lib/hadoop3$ cd bin/

bda@bda-VirtualBox:/usr/lib/hadoop3/bin$ ls
container-executor  hadoop  hadoop.cmd  hdfs  hdfs.cmd  mapred  mapred.cmd  oom-listener  test-container-executor  yarn  yarn.cmd

bda@bda-VirtualBox:/usr/lib/hadoop3/bin$ ./hadoop
                               Or
bda@bda-VirtualBox:/usr/lib/hadoop3$ bin/hadoop
bda@bda-VirtualBox:/usr/lib/hadoop3$ cd

bda@bda-VirtualBox:~$ ls
 
bda@bda-VirtualBox:~$ mkdir mapin
 
bda@bda-VirtualBox:~$ cd mapin
 
bda@bda-VirtualBox:~/mapin$ cat > input.txt
welcome to hadoop
class hadoop is
good hadoop is
bad
 
bda@bda-VirtualBox:~/mapin$ ls
input.txt
bda@bda-VirtualBox:~/mapin$ cat input.txt
welcome to hadoop
class hadoop is
good hadoop is
bad
 
bda@bda-VirtualBox:~$ cd /usr/lib/hadoop3/
bda@bda-VirtualBox:/usr/lib/hadoop3$ /usr/lib/hadoop3/bin/hadoop jar /usr/lib/hadoop3/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.4.jar grep ~/mapin/input.txt ~/mapout 'hadoop[.]*'
 
bda@bda-VirtualBox:/usr/lib/hadoop3$ cd
bda@bda-VirtualBox:~$ ls
Desktop    Downloads         hadoop-3.3.4         mapin   Music     Public   Templates
Documents  examples.desktop  hadoop-3.3.4.tar.gz  mapout  Pictures  sam.txt  Videos
 
bda@bda-VirtualBox:~$ cd mapout/
 
bda@bda-VirtualBox:~/mapout$ ls
part-r-00000  _SUCCESS
 
bda@bda-VirtualBox:~/mapout$ cat part-r-00000
3    hadoop

Output:



Limitations:
Standalone mode is the default mode of operation of Hadoop and it runs on a single node ( a node is your machine). HDFS and YARN doesn't run on standalone mode.

Conclusion:

Standalone Mode is the default operation of Hadoop Eco System where the hadoop services will run in the Single JVM. As in this experiment basic Java installation and extraction of the Hadoop files are sufficient to run the Hadoop services and Mapreduce wordcount Program.


Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...