- Enable to create or run MapReduce in any language Java/Non Java(R,Python,Ruby,PERL,PhP,c++)
- Allow to create and run MR Job with Script as Mapper/Reducer utility that comes with hadoop distribution
- Uses unix streams as interface between hadoop and MR program
- Allows developer to choose language
- Both Mapper Reducer are Python Scripts that read input from STDIN and exit output to STDOUT utility create a M/R Jobs, submit Job to cluster, monitor the progress of job until complete
- when script is specified for mapper - each mapper task will launches the script as a separate process
- Mapper tasks convert input (k,v) from source file into lines and feed to STDIN
- Mapper collect the line accessed output from STDOUT and convert each line into (k,v) pair
- when script is specified for reducer - each reducer task launches the script as separate process
- reducer task convert input (k,v) pair into lines and feeds the lines to STDIN
- reduce gather the line oriented output from STDOUT and convert each line into (k,v) pair
Program on Hadoop Streaming:
Step 1:Create an input.txt file in a folder name (stream)
hduser@ubuntu:~/stream$ cat input.txt
ram wants
to eat cb
ram wants
to eat cb
Step 2:Create a file mycmd.sh using the following script
hduser@ubuntu:~/stream$ cat mycmd.sh
#!/bin/bash
sed -r 's/[\t]+/\n/g' | sed "s/[^a-zA-Z0-9]//g" | tr "A-Z" "a-z"
#!/bin/bash
sed -r 's/[\t]+/\n/g' | sed "s/[^a-zA-Z0-9]//g" | tr "A-Z" "a-z"
Step 3: create a directory under hdfs and then copy the input.txt file into it using the following commands
hduser@ubuntu:~/stream$ hadoop fs -mkdir /dfs
hduser@ubuntu:~/stream$ hadoop fs -put input.txt /dfs
hduser@ubuntu:~/stream$ hadoop fs -cat /dfs/input.txt
hduser@ubuntu:~/stream$ hadoop fs -put input.txt /dfs
hduser@ubuntu:~/stream$ hadoop fs -cat /dfs/input.txt
Step 4:Run the hadoop jar streaming as follows
hduser@ubuntu:~/stream$ hadoop jar /home/hduser/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -D mapred.reduce.tasks=0 -input /dfs/input.txt -output /dfs/out -mapper ./mycmd.sh -reducer 'uniq-c' -file mycmd.sh
Step5:Output the hadoop streaming of the given input.txt file using the following commands
hduser@ubuntu:~/stream$ hadoop fs -lsr /dfs
hduser@ubuntu:~/stream$ hadoop fs -cat /dfs/out/part-00000
No comments:
Post a Comment