Pages

Thursday 21 March 2024

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class FriendCommon {
    public static class FriendMapper extends Mapper<Object, Text, Text, Text> {
        private Text pair = new Text();
        private Text user = new Text();
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
            String[] line = value.toString().split(" -> ");
            if (line.length == 2) {
                String[] friends = line[1].split(" ");
                for (int i = 0; i < friends.length; i++) {
                    for (int j = i + 1; j < friends.length; j++) {
                        String friendPair = friends[i].trim() + "," + friends[j].trim();
                        pair.set(friendPair);
                        user.set(line[0].trim());
                        context.write(pair, user);
                    }
                }
            }
        }
    }
    public static class FriendReducer extends Reducer<Text, Text, Text, Text> {
        private Text commonFriends = new Text();
        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            Set<String> userList = new HashSet<>();
            for (Text value : values) {
                userList.add(value.toString());
            }
            if (userList.size() > 1) {
                List<String> sortedUsers = new ArrayList<>(userList);
                Collections.sort(sortedUsers);
                commonFriends.set(String.join(" ", sortedUsers));
                context.write(key, commonFriends);
            }
        }
    }
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "FriendCommon");
        job.setJarByClass(FriendCommon.class);
        job.setMapperClass(FriendMapper.class);
        job.setReducerClass(FriendReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Step  to run Map Reduce program :

Input File:
A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E 
E -> B C D
 
hduser@ubuntu:~/fof$ ls
FriendCommon.java  in.txt

hduser@ubuntu:~/fof$ export CLASSPATH=`hadoop classpath`
hduser@ubuntu:~/fof$ echo $CLASSPATH
hduser@ubuntu:~/fof$ javac -d . FriendCommon.java 

hduser@ubuntu:~/fof$ ls
'FriendCommon$FriendMapper.class'  'FriendCommon$FriendReducer.class'   FriendCommon.class   FriendCommon.java   in.txt

hduser@ubuntu:~/fof$ jar -cvf fmr.jar -C /home/hduser/fof .

hduser@ubuntu:~/fof$ ls
 fmr.jar  'FriendCommon$FriendMapper.class'  'FriendCommon$FriendReducer.class'   FriendCommon.class   FriendCommon.java   in.txt

hduser@ubuntu:~/fof$ hadoop fs -mkdir  /fofrk

hduser@ubuntu:~/fof$ hadoop fs -put /home/hduser/fof/in.txt  /fofrk

hduser@ubuntu:~/fof$ hadoop fs -lsr /fofrk
lsr: DEPRECATED: Please use 'ls -R' instead.
-rw-r--r--   1 hduser supergroup         62 2024-03-21 11:36 /fofrk/in.txt

hduser@ubuntu:~/fof$ hadoop fs -cat /fofrk/in.txt
A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E 
E -> B C D

hduser@ubuntu:~/fof$ hadoop jar fmr.jar FriendCommon /fofrk/in.txt /fofrk/out

hduser@ubuntu:~/fof$ hadoop fs -lsr /fofrk
lsr: DEPRECATED: Please use 'ls -R' instead.
-rw-r--r--   1 hduser supergroup         62 2024-03-21 11:36 /fofrk/in.txt
drwxr-xr-x   - hduser supergroup          0 2024-03-21 11:40 /fofrk/out
-rw-r--r--   1 hduser supergroup          0 2024-03-21 11:40 /fofrk/out/_SUCCESS
-rw-r--r--   1 hduser supergroup         88 2024-03-21 11:40 /fofrk/out/part-r-00000

hduser@ubuntu:~/fof$ hadoop fs -cat /fofrk/in.txt
A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E 
E -> B C D

hduser@ubuntu:~/fof$ hadoop fs -cat /fofrk/out/part-r-00000
A,B C D
A,C B D
A,D B C
A,E B C D
B,C A D E
B,D A C E
B,E C D
C,D A B E
C,E B D
D,E B C
 
Output:

Wednesday 13 December 2023

Display your own Text on ubuntu terminal

$ figlet -f slant "Big Data" -c | lolcat && figlet -f digital -c "Drive into DataScience" | lolcat 

How to display the above text on terminal:


Useful commands:

How to check the number of directories created by the user on terminal:

csedept@cse:~$ ls /home
csedept  hdoop

csedept@cse:~$ ls -l /home/hdoop/
total 16
-rw-r--r--  1 hdoop hdoop 8980 Dec 18 16:26 examples.desktop
drwxr-xr-x 14 hdoop hdoop 4096 Dec 18 16:55 hadoop

How to enter into the root terminal:
csedept@cse:~$ sudo -s
root@cse:/home/csedept# 

How to use id command on the  terminal:
csedept@cse:~$ id
uid=1000(csedept) gid=1000(csedept) groups=1000(csedept),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),116(lpadmin),126(sambashare)
csedept@cse:~$ id -u
1000
csedept@cse:~$ id -n -u
csedept

csedept@cse:~$ sudo adduser duck
[sudo] password for csedept: 
Adding user `duck' ...
Adding new group `duck' (1001) ...
Adding new user `duck' (1001) with group `duck' ...
Creating home directory `/home/duck' ...
Copying files from `/etc/skel' ...
New password: 
Retype new password: 
passwd: password updated successfully
Changing the user information for duck
Enter the new value, or press ENTER for the default
Full Name []:duck 
Room Number []: 
Work Phone []: 
Home Phone []: 
Other []: 
Is the information correct? [Y/n] y
csedept@cse:~$ ls /home/
csedept  duck  hdoop

How to switch the user using su command:
csedept@cse:~$ su - duck
Password: 
duck@cse:~$ logout
csedept@cse:~$ 


Friday 21 April 2023

Friends-of-friends

Social network sites such as LinkedIn and Facebook use the FoF algorithm to help users broaden their networks.

The key ingredient to success with this approach is to order the FoFs by the number of common friends, which increases the chances that the user knows the FoF.


1.Example : How to implement the FoF algorithm in MapReduce.

Two MapReduce jobs are required to calculate the FoFs for each user in a social network. The first job calculates the common friends for each user, and the second job sorts the common friends by the number of connections to your friends



2.Example : How to implement the FoF algorithm in MapReduce.

A -> B C D
B -> A C D E
C -> A B D E
D -> A B C E 
E -> B C D

Mapper 1 output of A -> B C D
(A,B) -> (B C D)
(A,C) -> (B C D)
(A,D) -> (B C D)

Mapper 2 output of B -> A C D E
(A,B) -> (A C D E)
(B,C) -> (A C D E)
(B,D) -> (A C D E)
(B,E) -> (A C D E)

Mapper 3 output of C -> A B D E
(A,C) -> (A B D E)
(B,C) -> (A B D E)
(C,D) -> (A B D E)
(C,E) -> (A B D E)

Mapper 4 output of D -> A B C E 
(A,D) -> (A B C E)
(B,D) -> (A B C E)
(C,D) -> (A B C E)
(D,E) -> (A B C E)

Mapper 5 output of E -> B C D
(B,E) -> (B C D)
(C,E) -> (B C D)
(D,E) -> (B C D)

Shuffle Or Group:
(A,B) -> (A C D E) 
(A,B) -> (B C D)
-------------------------
(A,B) -> (A C D E) (B C D)

(A,C) -> (B C D)
(A,C) -> (A B D E)
-------------------------
(A,C) -> (A B D E) (B C D)

(A,D) -> (B C D)
(A,D) -> (A B C E)
-------------------------
(A,D) -> (A B C E) (B C D)

(B,C) -> (A C D E)
(B,C) -> (A B D E)
-------------------------
(B,C) -> (A B D E) (A C D E)

(B,D) -> (A C D E)
(B,D) -> (A B C E)
-------------------------
(B,D) -> (A B C E) (A C D E)

(B,E) -> (A C D E)
(B,E) -> (B C D)
-------------------------
(B,E) -> (A C D E) (B C D)

(C,D) -> (A B D E)
(C,D) -> (A B C E)
-------------------------
(C,D) -> (A B D E) (A B C E)

(C,E) -> (A B D E)
(C,E) -> (B C D)
-------------------------
(C,E) -> (A B D E) (B C D)

(D,E) -> (A B C E)
(D,E) -> (B C D)
-------------------------
(D,E) -> (A B C E) (B C D)

Reducer:
(A,B) -> (A C D E) (B C D)
The common pair is 
(A,B) -> (C D) 

(A,C) -> (A B D E) (B C D)
The common pair is 
(A,C) -> (B D)

(A,D) -> (A B C E) (B C D)
The common pair is 
(A,D) -> (B C ) 

(B,C) -> (A B D E) (A C D E)
The common pair is 
(B,C) -> (A D E) 

(B,D) -> (A B C E) (A C D E)
The common pair is 
(B,D) -> (A C E) 

(B,E) -> (A C D E) (B C D)
The common pair is 
(B,E) -> (C D) 

(C,D) -> (A B D E) (A B C E)
The common pair is 
(C,D) -> (A B E) 

(C,E) -> (A B D E) (B C D)
The common pair is 
(C,E) -> (B D) 

(D,E) -> (A B C E) (B C D)
The common pair is 
(D,E) -> (B C) 

Saturday 15 April 2023

Example - Shortest Path - Dijkstra's algorithm

Relaxation;
d(u) + c(u,v) < d(v)
d(v)=d(u) + c(u,v)

Source

A

B

C

D

E

F

A

0

B

 

2

4

C

 

 

3

6

4

E

 

 

 

6

4

D

 

 

 

6

 

6

F

 

 

 

 

 

6


Find the shortest path from A to F using the above table : F E B A
Find the shortest path from A to E using the above table : E B A
Find the shortest path from A to C using the above table : C B A
Find the shortest path from A to D using the above table : D B A


Friday 14 April 2023

Page Rank Algorithm using Directed Graph

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

  • PageRank is a function that assigns a real number to each page in the Web (or at least to that portion of the Web that has been crawled and its links discovered). 
  • The intent is that the higher the PageRank of a page, the more “important” it is. 
  • There is one most popular model known as Random Surfer Model 

Consider a directed graph with 4 nodes A,B&C and D ,in which D is a Dangling node

For Example: Consider a directed graph with 3 nodes A,B & C 
Iteration 1:

Iteration 2:

The output of all the iteration in one table as shown:
 
Trick to find the Page Rank :
For the above given graph with ABC nodes 

Page rank of a node is given by 
indegree(name of the node)/(name of the node outdegree)

Page rank of A = C / 1
Page rank of B = A / 2
Page rank of C = A/2 + B/1

Iteration      A      B      C
   0              1      1      1
   1              1      0.5   1.5
   2              1.5   0.5   1

Thursday 13 April 2023

Shortest Path Algorithm- Dijkstra algorithm

Dijkstra algorithm is a single-source shortest path algorithm. Here, single-source means that only one source is given, and we have to find the shortest path from the source to all the nodes.
First, we have to consider any vertex as a source vertex. Suppose we consider vertex 0 as a source vertex.
Here we assume that 0 as a source vertex, and distance to all the other vertices is infinity. Initially, we do not know the distances. First, we will find out the vertices which are directly connected to the vertex 0. As we can observe in the above graph that two vertices are directly connected to vertex 0.
 
Let's assume that the vertex 0 is represented by 'x' and the vertex 1 is represented by 'y'. The distance between the vertices can be calculated by using the below formula:

d(x, y) = d(x) + c(x, y)  < d(y)  
= (0 + 4) < ∞  
= 4 < ∞ 
Since 4<∞ so we will update d(v) from ∞ to 4.
Therefore, we come to the conclusion that the formula for calculating the distance between the vertices:
{if( d(u) + c(u, v) < d(v))  
d(v) = d(u)  +c(u, v) }  
Now we consider vertex 0 same as 'x' and vertex 4 as 'y'.

d(x, y) = d(x) + c(x, y)  < d(y)  
= (0 + 8) < ∞  
= 8 < ∞  
Therefore, the value of d(y) is 8. We replace the infinity value of vertices 1 and 4 with the values 4 and 8 respectively. Now, we have found the shortest path from the vertex 0 to 1 and 0 to 4. Therefore, vertex 0 is selected. Now, we will compare all the vertices except the vertex 0. Since vertex 1 has the lowest value, i.e., 4; therefore, vertex 1 is selected.
Since vertex 1 is selected, so we consider the path from 1 to 2, and 1 to 4. We will not consider the path from 1 to 0 as the vertex 0 is already selected.
First, we calculate the distance between the vertex 1 and 2. Consider the vertex 1 as 'x', and the vertex 2 as 'y'.

d(x, y) = d(x) + c(x, y) < d(y)  
= (4 + 8) < ∞  
= 12 < ∞  
Since 12<∞ so we will update d(2) from ∞ to 12.

Now, we calculate the distance between the vertex 1 and vertex 4. Consider the vertex 1 as 'x' and the vertex 4 as 'y'.
d(x, y) = d(x) + c(x, y) < d(y)  
= (4 + 11) < 8  
= 15 < 8  
Since 15 is not less than 8, we will not update the value d(4) from 8 to 12.
Till now, two nodes have been selected, i.e., 0 and 1. Now we have to compare the nodes except the node 0 and 1. The node 4 has the minimum distance, i.e., 8. Therefore, vertex 4 is selected.
Since vertex 4 is selected, so we will consider all the direct paths from the vertex 4. The direct paths from vertex 4 are 4 to 0, 4 to 1, 4 to 8, and 4 to 5. Since the vertices 0 and 1 have already been selected so we will not consider the vertices 0 and 1. We will consider only two vertices, i.e., 8 and 5.
First, we consider the vertex 8. First, we calculate the distance between the vertex 4 and 8. Consider the vertex 4 as 'x', and the vertex 8 as 'y'.
d(x, y) = d(x) + c(x, y) < d(y)  
= (8 + 7) < ∞  
= 15 < ∞  
Since 15 is less than the infinity so we update d(8) from infinity to 15.
Now, we consider the vertex 5. First, we calculate the distance between the vertex 4 and 5. Consider the vertex 4 as 'x', and the vertex 5 as 'y'.
d(x, y) = d(x) + c(x, y) < d(y)  
= (8 + 1) < ∞  
= 9 < ∞  
Since 5 is less than the infinity, we update d(5) from infinity to 9.

Till now, three nodes have been selected, i.e., 0, 1, and 4. Now we have to compare the nodes except the nodes 0, 1 and 4. The node 5 has the minimum value, i.e., 9. Therefore, vertex 5 is selected.
Since the vertex 5 is selected, so we will consider all the direct paths from vertex 5. The direct paths from vertex 5 are 5 to 8, and 5 to 6.
First, we consider the vertex 8. First, we calculate the distance between the vertex 5 and 8. Consider the vertex 5 as 'x', and the vertex 8 as 'y'.
d(x, y) = d(x) + c(x, y) < d(y)  
= (9 + 15) < 15  
= 24 < 15  
Since 24 is not less than 15 so we will not update the value d(8) from 15 to 24.
Now, we consider the vertex 6. First, we calculate the distance between the vertex 5 and 6. Consider the vertex 5 as 'x', and the vertex 6 as 'y'.
d(x, y) = d(x) + c(x, y) < d(y)  
= (9 + 2) < ∞</p>  
= 11 < ∞  
Since 11 is less than infinity, we update d(6) from infinity to 11.
Till now, nodes 0, 1, 4 and 5 have been selected. We will compare the nodes except the selected nodes. The node 6 has the lowest value as compared to other nodes. Therefore, vertex 6 is selected.
Since vertex 6 is selected, we consider all the direct paths from vertex 6. The direct paths from vertex 6 are 6 to 2, 6 to 3, and 6 to 7.
First, we consider the vertex 2. Consider the vertex 6 as 'x', and the vertex 2 as 'y'.
d(x, y) = d(x) + c(x, y) < d(y)  
= (11 + 4) < 12  
= 15 < 12  
Since 15 is not less than 12, we will not update d(2) from 12 to 15
Now we consider the vertex 3. Consider the vertex 6 as 'x', and the vertex 3 as 'y'.
d(x, y) = d(x) + c(x, y) < d(y)  
= (11 + 14) < ∞  
= 25 < ∞  
Since 25 is less than ∞, so we will update d(3) from ∞ to 25.
Now we consider the vertex 7. Consider the vertex 6 as 'x', and the vertex 7 as 'y'.
d(x, y) = d(x) + c(x, y) < d(y)  
= (11 + 10) < ∞  
= 22 < ∞  
Since 22 is less than ∞ so, we will update d(7) from ∞ to 22.

Till now, nodes 0, 1, 4, 5, and 6 have been selected. Now we have to compare all the unvisited nodes, i.e., 2, 3, 7, and 8. Since node 2 has the minimum value, i.e., 12 among all the other unvisited nodes. Therefore, node 2 is selected.
Since node 2 is selected, so we consider all the direct paths from node 2. The direct paths from node 2 are 2 to 8, 2 to 6, and 2 to 3.
First, we consider the vertex 8. Consider the vertex 2 as 'x' and 8 as 'y'.
d(x, y) = d(x) + c(x, y) < d(y)  
= (12 + 2) < 15  
= 14 < 15  
Since 14 is less than 15, we will update d(8) from 15 to 14.
Now, we consider the vertex 6. Consider the vertex 2 as 'x' and 6 as 'y'.
d(x, y) = d(x) + c(x, y) < d(y)  
= (12 + 4) < 11  
= 16 < 11  
Since 16 is not less than 11 so we will not update d(6) from 11 to 16.
Now, we consider the vertex 3. Consider the vertex 2 as 'x' and 3 as 'y'.
d(x, y) = d(x) + c(x, y) < d(y)  
= (12 + 7) < 25  
= 19 < 25  
Since 19 is less than 25, we will update d(3) from 25 to 19.

Till now, nodes 0, 1, 2, 4, 5, and 6 have been selected. We compare all the unvisited nodes, i.e., 3, 7, and 8. Among nodes 3, 7, and 8, node 8 has the minimum value. The nodes which are directly connected to node 8 are 2, 4, and 5. Since all the directly connected nodes are selected so we will not consider any node for the updation.
The unvisited nodes are 3 and 7. Among the nodes 3 and 7, node 3 has the minimum value, i.e., 19. Therefore, the node 3 is selected. The nodes which are directly connected to the node 3 are 2, 6, and 7. Since the nodes 2 and 6 have been selected so we will consider these two nodes.

Now, we consider the vertex 7. Consider the vertex 3 as 'x' and 7 as 'y'.
d(x, y) = d(x) + c(x, y)  < d(y)  
= (19 + 9) < 21  
= 28 < 21  
Since 28 is not less than 21, so we will not update d(7) from 28 to 21

Tuesday 11 April 2023

Chaining MapReduce jobs - Joining data from different sources - Data Flow of Reduce side join:

In data analyses we need to gather the data from two or more different sources. 

If we want an inner join of the two data sets above, the desired output would look as listed below For example, Let’s take a two comma-separated files 
1. Customers file with three fields: Customer ID, Name, and Phone Number. We put four records in the file for illustration

C_ID

Name

Phone_No

1

Ram

8977101699

2

Rani

8977101688

3

Vani

8977101677

4

Dhoni

8977101666







2. Order file with four fields: Customer ID, Order ID, Price, and Purchase Date.

C_ID

O_ID

Price

Date

3

A

100

11-05-2020

1

B

200

17-06-2021

2

C

300

19-02-2020

3

D

400

27-06-2021







If we want an inner join of the two data sets above, the desired output would look as listed below

C_ID

Name

Phone_No

O_ID

Price

Date

1

Ram

8977101699

B

200

17-06-2021

2

Rani

8977101688

       C
     300
       19-02-2020

3

Vani

8977101677

A

100

11-05-2020

3

Vani

8977101677

D

400

27-06-2021

Data Flow of Reduce side join:

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...