- Enable to create or run MapReduce in any language Java/Non Java(R,Python,Ruby,PERL,PhP,c++)
- Allow to create and run MR Job with Script as Mapper/Reducer utility that comes with hadoop distribution
- Uses unix streams as interface between hadoop and MR program
- Allows developer to choose language
- Both Mapper Reducer are Python Scripts that read input from STDIN and exit output to STDOUT utility create a M/R Jobs, submit Job to cluster, monitor the progress of job until complete
- when script is specified for mapper - each mapper task will launches the script as a separate process
- Mapper tasks convert input (k,v) from source file into lines and feed to STDIN
- Mapper collect the line accessed output from STDOUT and convert each line into (k,v) pair
- when script is specified for reducer - each reducer task launches the script as separate process
- reducer task convert input (k,v) pair into lines and feeds the lines to STDIN
- reduce gather the line oriented output from STDOUT and convert each line into (k,v) pair
Program on Hadoop Streaming using Shell Script:
Step 1:Create an input.txt file in a folder name (stream)
hduser@ubuntu:~/stream$ cat input.txt
ram wants
to eat cb
ram wants
to eat cb
Step 2:Create a file mycmd.sh using the following script
hduser@ubuntu:~/stream$ cat mycmd.sh
#!/bin/bash
sed -r 's/[\t]+/\n/g' | sed "s/[^a-zA-Z0-9]//g" | tr "A-Z" "a-z"
#!/bin/bash
sed -r 's/[\t]+/\n/g' | sed "s/[^a-zA-Z0-9]//g" | tr "A-Z" "a-z"
Step 3: create a directory under hdfs and then copy the input.txt file into it using the following commands
hduser@ubuntu:~/stream$ hadoop fs -mkdir /dfs
hduser@ubuntu:~/stream$ hadoop fs -put input.txt /dfs
hduser@ubuntu:~/stream$ hadoop fs -cat /dfs/input.txt
hduser@ubuntu:~/stream$ hadoop fs -put input.txt /dfs
hduser@ubuntu:~/stream$ hadoop fs -cat /dfs/input.txt
Step 4:Run the hadoop jar streaming as follows
hduser@ubuntu:~/stream$ hadoop jar /home/hduser/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -D mapred.reduce.tasks=0 -input /dfs/input.txt -output /dfs/out -mapper ./mycmd.sh -reducer 'uniq-c' -file mycmd.sh
Step5:Output the hadoop streaming of the given input.txt file using the following commands
hduser@ubuntu:~/stream$ hadoop fs -lsr /dfs
hduser@ubuntu:~/stream$ hadoop fs -cat /dfs/out/part-00000
Program on Hadoop Streaming using Python Code:
Step 1: Create an input.txt ,mapper.py ,reducer.py files in a folder hdstream (Local File System)
echo -e "hello world\nhello hadoop\nhadoop streaming example" > input.txt
Or
cat > input.txt
hello world
hello hadoop
hadoop streaming example
Program for mapper.py :
#!/usr/bin/env python3 // Shebang Line Tells the operating system to execute this script using Python 3
import sys //s
ys provides access to:sys.stdin → standard input & sys.stdout → standard outputfor line in sys.stdin: //Read Input Line by Line
line = line.strip() //Remove Extra Spaces
words = line.split() //Split Line into Words
for word in words:
// Iterate Over Each Word - Loop through each word in the list -If line has 3 words → loop runs 3 times
print(f"{word}\t1")
//Emit Key-Value Pair - This is the most important line.
Why
\t (tab)? - Hadoop expects key-value pairs separated by tab - Default separator in Hadoop StreamingProgram for reducer.py :
#!/usr/bin/env python3
import sys
current_word = None
current_count = 0
for line in sys.stdin:
line = line.strip() //Removes extra spaces and newline characters.
word, count = line.split("\t", 1) //Split Key and Value
count = int(count) //Convert count from string to integer.
if current_word == word:
current_count += count
else:
if current_word:
print(f"{current_word}\t{current_count}")
current_word = word
current_count = count
if current_word == word:
print(f"{current_word}\t{current_count}")
hdoop@cse:~/hdstream$ ls
input.txt mapper.py reducer.py
Step 2 : make mapper.py and reducer.py files executable
hdoop@cse:~/hdstream$ chmod +x mapper.py
hdoop@cse:~/hdstream$ chmod +x reducer.py
hdoop@cse:~/hdstream$ ls
input.txt mapper.py reducer.py
Step 3: create a folder in DFS rkhdstream and then copy the input.txt from LFS to DFS
hdoop@cse:~/hdstream$ hadoop fs -mkdir /rkhdstream
hdoop@cse:~/hdstream$ hadoop fs -put input.txt /rkhdstream
Step 4: rum hadoop streaming command as shown
doop@cse:~/hdstream$ hadoop jar /home/hdoop/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -input /rkhdstream/input.txt -output /rkhdstream/outputrk -mapper mapper.py -reducer reducer.py -file mapper.py -file reducer.py
Step 5 : Check out the result
hdoop@cse:~/hdstream$ hadoop fs -lsr /rkhdstream
hdoop@cse:~/hdstream$ hadoop fs -cat /rkhdstream/input.txt
hdoop@cse:~/hdstream$ hadoop fs -lsr /rkhdstream/outputrk/p*












No comments:
Post a Comment