Pages

Friday, 24 March 2023

Hadoop Streaming

  • Enable to create or run MapReduce in any language Java/Non Java(R,Python,Ruby,PERL,PhP,c++)
  • Allow to create and run MR Job with Script as Mapper/Reducer utility that comes with hadoop distribution
  • Uses unix streams as interface between hadoop and MR program
  • Allows developer to choose language
  • Both Mapper Reducer are Python Scripts that read input from STDIN and exit output to STDOUT utility create a M/R Jobs, submit Job to cluster, monitor the progress of job until complete
  • when script is specified for mapper - each mapper task will launches the script as a separate process 
  • Mapper tasks convert input (k,v) from source file into lines and feed to STDIN
  • Mapper collect the line accessed output from STDOUT and convert each line into (k,v) pair
  • when script is specified for reducer - each reducer task launches the script as separate process
  • reducer task convert input (k,v) pair into lines and feeds the lines to STDIN
  • reduce gather the line oriented output from STDOUT and convert each line into (k,v) pair 

Program on Hadoop Streaming using Shell Script:
Step 1:Create an input.txt file in a folder name (stream)
hduser@ubuntu:~/stream$ cat input.txt
ram wants
to eat cb
 
Step 2:Create a file mycmd.sh using the following script
hduser@ubuntu:~/stream$ cat mycmd.sh
#!/bin/bash
sed -r 's/[\t]+/\n/g' | sed "s/[^a-zA-Z0-9]//g" | tr "A-Z" "a-z"


Step 3: create a directory under hdfs and then copy the input.txt file into it using the following commands
hduser@ubuntu:~/stream$ hadoop fs -mkdir /dfs
hduser@ubuntu:~/stream$ hadoop fs -put input.txt /dfs
hduser@ubuntu:~/stream$ hadoop fs -cat /dfs/input.txt
 
Step 4:Run the hadoop jar streaming as follows
hduser@ubuntu:~/stream$ hadoop jar /home/hduser/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -D mapred.reduce.tasks=0 -input /dfs/input.txt -output /dfs/out -mapper ./mycmd.sh -reducer 'uniq-c' -file mycmd.sh 
Step5:Output the hadoop streaming of the given input.txt file using the following commands
hduser@ubuntu:~/stream$ hadoop fs -lsr /dfs
hduser@ubuntu:~/stream$ hadoop fs -cat /dfs/out/part-00000

Program on Hadoop Streaming using Python Code:

Step 1: Create an input.txt ,mapper.py ,reducer.py files in a folder hdstream (Local File System)
echo -e "hello world\nhello hadoop\nhadoop streaming example" > input.txt
                                                  Or
cat > input.txt
hello world
hello hadoop
hadoop streaming example

Program for mapper.py :
#!/usr/bin/env python3  // Shebang Line Tells the operating system to execute this script using Python 3

import sys     //sys provides access to:sys.stdin → standard input & sys.stdout → standard output

for line in sys.stdin:     //Read Input Line by Line

    line = line.strip()        //Remove Extra Spaces

    words = line.split()     //Split Line into Words

    for word in words:      
// Iterate Over Each Word - Loop through each word in the list -If line has 3 words → loop runs 3 times
 
       print(f"{word}\t1") 
//Emit Key-Value Pair - This is the most important line
Why \t (tab)? - Hadoop expects key-value pairs separated by tab - Default separator in Hadoop Streaming

Program for reducer.py :
#!/usr/bin/env python3
import sys

current_word = None
current_count = 0

for line in sys.stdin:
    line = line.strip()     //Removes extra spaces and newline characters.
    word, count = line.split("\t", 1)     //Split Key and Value
    count = int(count)     //Convert count from string to integer.

    if current_word == word:
        current_count += count
    else:
        if current_word:
            print(f"{current_word}\t{current_count}")
        current_word = word
        current_count = count

if current_word == word:
    print(f"{current_word}\t{current_count}")

hdoop@cse:~/hdstream$ ls
input.txt  mapper.py  reducer.py

Step 2 : make mapper.py and reducer.py files executable
hdoop@cse:~/hdstream$ chmod +x mapper.py 
hdoop@cse:~/hdstream$ chmod +x reducer.py 
hdoop@cse:~/hdstream$ ls
input.txt  mapper.py  reducer.py


Step 3: create a folder in DFS rkhdstream and then copy the input.txt from LFS to DFS 
hdoop@cse:~/hdstream$ hadoop fs -mkdir /rkhdstream
hdoop@cse:~/hdstream$ hadoop fs -put input.txt /rkhdstream

Step 4: rum hadoop streaming command as shown 
doop@cse:~/hdstream$ hadoop jar /home/hdoop/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -input /rkhdstream/input.txt -output /rkhdstream/outputrk -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py 

Step 5 : Check out the result
hdoop@cse:~/hdstream$ hadoop fs -lsr /rkhdstream
hdoop@cse:~/hdstream$ hadoop fs -cat /rkhdstream/input.txt
hdoop@cse:~/hdstream$ hadoop fs -lsr /rkhdstream/outputrk/p*

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...