Pages

Friday 24 March 2023

Hadoop Streaming

  • Enable to create or run MapReduce in any language Java/Non Java(R,Python,Ruby,PERL,PhP,c++)
  • Allow to create and run MR Job with Script as Mapper/Reducer utility that comes with hadoop distribution
  • Uses unix streams as interface between hadoop and MR program
  • Allows developer to choose language
  • Both Mapper Reducer are Python Scripts that read input from STDIN and exit output to STDOUT utility create a M/R Jobs, submit Job to cluster, monitor the progress of job until complete
  • when script is specified for mapper - each mapper task will launches the script as a separate process 
  • Mapper tasks convert input (k,v) from source file into lines and feed to STDIN
  • Mapper collect the line accessed output from STDOUT and convert each line into (k,v) pair
  • when script is specified for reducer - each reducer task launches the script as separate process
  • reducer task convert input (k,v) pair into lines and feeds the lines to STDIN
  • reduce gather the line oriented output from STDOUT and convert each line into (k,v) pair 
Program on Hadoop Streaming:
Step 1:Create an input.txt file in a folder name (stream)
hduser@ubuntu:~/stream$ cat input.txt
ram wants
to eat cb
 
Step 2:Create a file mycmd.sh using the following script
hduser@ubuntu:~/stream$ cat mycmd.sh
#!/bin/bash
sed -r 's/[\t]+/\n/g' | sed "s/[^a-zA-Z0-9]//g" | tr "A-Z" "a-z"


Step 3: create a directory under hdfs and then copy the input.txt file into it using the following commands
hduser@ubuntu:~/stream$ hadoop fs -mkdir /dfs
hduser@ubuntu:~/stream$ hadoop fs -put input.txt /dfs
hduser@ubuntu:~/stream$ hadoop fs -cat /dfs/input.txt
 
Step 4:Run the hadoop jar streaming as follows
hduser@ubuntu:~/stream$ hadoop jar /home/hduser/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -D mapred.reduce.tasks=0 -input /dfs/input.txt -output /dfs/out -mapper ./mycmd.sh -reducer 'uniq-c' -file mycmd.sh 
Step5:Output the hadoop streaming of the given input.txt file using the following commands
hduser@ubuntu:~/stream$ hadoop fs -lsr /dfs
hduser@ubuntu:~/stream$ hadoop fs -cat /dfs/out/part-00000

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...