BigData: Hadoop Streaming

Friday, 24 March 2023

Hadoop Streaming

Enable to create or run MapReduce in any language Java/Non Java(R,Python,Ruby,PERL,PhP,c++)
Allow to create and run MR Job with Script as Mapper/Reducer utility that comes with hadoop distribution
Uses unix streams as interface between hadoop and MR program
Allows developer to choose language
Both Mapper Reducer are Python Scripts that read input from STDIN and exit output to STDOUT utility create a M/R Jobs, submit Job to cluster, monitor the progress of job until complete
when script is specified for mapper - each mapper task will launches the script as a separate process
Mapper tasks convert input (k,v) from source file into lines and feed to STDIN
Mapper collect the line accessed output from STDOUT and convert each line into (k,v) pair
when script is specified for reducer - each reducer task launches the script as separate process
reducer task convert input (k,v) pair into lines and feeds the lines to STDIN
reduce gather the line oriented output from STDOUT and convert each line into (k,v) pair

Program on Hadoop Streaming:

Step 1:Create an input.txt file in a folder name (stream)

hduser@ubuntu:~/stream$ cat input.txt
ram wants
to eat cb

Step 2:Create a file mycmd.sh using the following script

hduser@ubuntu:~/stream$ cat mycmd.sh
#!/bin/bash
sed -r 's/[\t]+/\n/g' | sed "s/[^a-zA-Z0-9]//g" | tr "A-Z" "a-z"

Step 3: create a directory under hdfs and then copy the input.txt file into it using the following commands

hduser@ubuntu:~/stream$ hadoop fs -mkdir /dfs
hduser@ubuntu:~/stream$ hadoop fs -put input.txt /dfs
hduser@ubuntu:~/stream$ hadoop fs -cat /dfs/input.txt

Step 4:Run the hadoop jar streaming as follows

hduser@ubuntu:~/stream$ hadoop jar /home/hduser/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar -D mapred.reduce.tasks=0 -input /dfs/input.txt -output /dfs/out -mapper ./mycmd.sh -reducer 'uniq-c' -file mycmd.sh

Step5:Output the hadoop streaming of the given input.txt file using the following commands

hduser@ubuntu:~/stream$ hadoop fs -lsr /dfs

hduser@ubuntu:~/stream$ hadoop fs -cat /dfs/out/part-00000

BigData

Pages

Friday, 24 March 2023

Hadoop Streaming

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Report Abuse

Labels