Pages

Friday, 30 April 2021

MapReduce Algorithm for Matrix Multiplication

MapReduce is a software framework and programming model used for processing huge amounts of data. MapReduce program work in two phases, namely, 

Map Map tasks deal with splitting and mapping of data 

Reduce:  Reduce tasks shuffle and reduce the data.

Assumes that the matrix A is an m*n, where m represent the row i and n represent the column j whose elements  aij denoted as the value

similarly Assumes that the matrix B is an n*p, where n represent the row j and p represent the column k whose elements  bjk denoted as the value

The Map function Algorithm :

  • for each element of Matrix A do
  • produce (key ,value) pair as 
  • (i, k) (A,j,aij). 
  • for each element of Matrix B do
  • produce (key ,value) pair as 
  • (i, k) (B,j,bjk). 
  • return set of (key ,value) pairs

The Reduce function Algorithm:

  • for each key(i,k) do
  • sort values begin with A by j in list A
  • sort values begin with B by j in list B 
  • multiply aij  and bjk  for jth value of each list
  • sum up aij  and bjk

Example: 
Consider a matrix A whose order is  m*n = i*j (2 * 3) 
and matrix B whose order is  n*p = j*k (3 * 2)



INPUT FILE :

Matrix

i/k

j

value

A

0

0

1

A

0

1

2

A

0

2

3

A

1

0

4

A

1

1

5

A

1

2

6

B

0

0

6

B

1

0

3

B

0

1

5

B

1

1

2

B

0

2

4

B

1

2

1


Where,
i represents rows of matrix A
j represents columns  of matrix A
j represents rows of matrix B
k represents columns of matrix B


Map function for both Matrix A and Matrix B:
Matrix A:
  • for each element of Matrix A do
  • produce (key ,value) pair as 
  • key ---(i, k) and pair ---- (A,j,aij). 
aij  = 1 (where i=0  j=0)
key ---(0, k) and pair ---- (A,0,aij)     where k is the duplicate and we need to do based on k value
key ---(0, 0) and pair ---- (A,0,1)
key ---(0, 1) and pair ---- (A,0,1)

aij  = 2 (where i=0  j=1)
key ---(0, k) and pair ---- (A,1,aij)    where k is the duplicate and we need to do based on k value
key ---(0, 0) and pair ---- (A,1,2)
key ---(0, 1) and pair ---- (A,1,2)

aij  = 3 (where i=0  j=2)
key ---(0, k) and pair ---- (A,2,aij)    where k is the duplicate and we need to do based on k value
key ---(0, 0) and pair ---- (A,2,3)
key ---(0, 1) and pair ---- (A,2,3)

aij  = 4 (where i=1  j=0)
key ---(1, k) and pair ---- (A,0,aij)   where k is the duplicate and we need to do based on k value
key ---(1, 0) and pair ---- (A,0,4)
key ---(1, 1) and pair ---- (A,0,4)

aij  = 5 (where i=1  j=1)
key ---(1, k) and pair ---- (A,1,aij)   where k is the duplicate and we need to do based on k value
key ---(1, 0) and pair ---- (A,1,5)
key ---(1, 1) and pair ---- (A,1,5)

aij  = 6 (where i=1  j=2)
key ---(1, k) and pair ---- (A,2,aij)  where k is the duplicate and we need to do based on k value
key ---(1, 0) and pair ---- (A,2,6)
key ---(1, 1) and pair ---- (A,2,6)

Matrix B:
  • for each element of Matrix B do
  • produce (key ,value) pair as 
  • key ---(i, k) and pair ---- (B,j,bjk). 
bjk = 6 (where  j=0  k=0)
key ---(i, 0) and pair ---- (B,0,bjkwhere i is the duplicate and we need to do based on i value
key ---(0, 0) and pair ---- (B,0,6)
key ---(1, 0) and pair ---- (B,0,6)

bjk = 3 (where  j=0  k=1)
key ---(i, 1) and pair ---- (B,0,bjk)  where i is the duplicate and we need to do based on i value
key ---(0, 1) and pair ---- (B,0,3)
key ---(1, 1) and pair ---- (B,0,3)

bjk = 5 (where  j=1  k=0)
key ---(i, 0) and pair ---- (B,1,bjk)  where i is the duplicate and we need to do based on i value
key ---(0, 0) and pair ---- (B,1,5)
key ---(1, 0) and pair ---- (B,1,5)

bjk = 2 (where  j=1  k=1)
key ---(i, 1) and pair ---- (B,1,bjk)  where i is the duplicate and we need to do based on i value
key ---(0, 1) and pair ---- (B,1,2)
key ---(1, 1) and pair ---- (B,1,2)

bjk = 4 (where  j=2  k=0)
key ---(i, 0) and pair ---- (B,2,bjk)  where i is the duplicate and we need to do based on i value
key ---(0, 0) and pair ---- (B,2,4)
key ---(1, 0) and pair ---- (B,2,4)

bjk = 1 (where  j=2  k=1)
key ---(i, 1) and pair ---- (B,2,bjk)  where i is the duplicate and we need to do based on i value
key ---(0, 1) and pair ---- (B,2,1)
key ---(1, 1) and pair ---- (B,2,1)

Shuffle/Group :
(0,0) -> 
 (A,0,1(A,1,2(A,2,3)
 (B,0,6(B,1,5(B,2,4)

(0,1) ->
(A,0,1(A,1,2(A,2,3)
(B,0,3(B,1,2(B,2,1)

(1,0) ->
(A,0,4(A,1,5(A,2,6)
(B,0,6(B,1,5(B,2,4)

(1,1) ->
(A,0,4(A,1,5(A,2,6)
(B,0,3(B,1,2(B,2,1)

Reduce Function:

(0,0) -> 
 (A,0,1)   (A,1,2)     (A,2,3)
 (B,0,6)   (B,1,5)     (B,2,4)
___________________________
     1*6       2*5         3*4
        6    +   10     +   12      =    28
___________________________

(0,1) ->
(A,0,1)      (A,1,2)    (A,2,3)
(B,0,3)      (B,1,2)     (B,2,1)
___________________________
     1*3       2*2         3*1
        3    +     4     +      3      =    10
___________________________

(1,0) ->
(A,0,4)     (A,1,5)     (A,2,6)
(B,0,6)      (B,1,5)     (B,2,4)
___________________________
     4*6        5*5           6*4
        24    +   25     +   24      =    73
___________________________
(1,1) ->
(A,0,4)     (A,1,5)    (A,2,6)
(B,0,3)       B,1,2)    (B,2,1)
___________________________
     4*3        5*2         6*1
        12    +   10     +   6      =    28
___________________________
The final step in the MapReduce algorithm is to produce the matrix A × B
The unit of computation of matrix A × B is :

 

Thursday, 22 April 2021

Hadoop-Ecosystem

Introduction: Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It includes Apache projects and various commercial tools and solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools or solutions are used to supplement or support these major elements. All these tools work collectively to provide services such as absorption, analysis, storage and maintenance of data etc.

Following are the components that collectively form a Hadoop ecosystem:

HDFS: Hadoop Distributed File System
YARN ,Map Reduce: Yet Another Resource Negotiator,: Programming based Data Processing
Sqoop: Data exchange
Flume: Log collector
Hive: Query based processing of data services
Pig: Query based processing of data services
Mahout: Machine Learning algorithm libraries
R Connectors: Statistical data 
Ambari: Managing and monitoring Hadoop clusters
Zookeeper: Managing cluster
Oozie: Job Scheduling
Hbase: NoSQL Database


All these toolkits or components revolve around one term i.e. Data. That’s the beauty of Hadoop that it revolves around data and hence making its synthesis easier.

HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing large data sets of structured or unstructured data across various nodes and thereby maintaining the metadata in the form of log files.
HDFS consists of two core components i.e.
Name node
Data Node
Name Node is the primary node which contains metadata (data about data) and the data nodes that stores the actual data. These data nodes are commodity hardware in the distributed environment.


YARN ,MapReduce
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to manage the resources across the clusters. In short, it performs scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
Resource Manager
Nodes Manager
Application Manager













MapReduce:
MapReduce makes it possible to carry over the processing’s logic and helps to write applications which transform big data sets into a manageable one. MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
Map() performs sorting and filtering of data and thereby organizing them in the form of group. Map generates a key-value pair based result which is later on processed by the Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the mapped data. In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.









 
Sqoop:
Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases (such as Oracle and MySQL) to HDFS and export data from HDFS to relational databases.












Flume:
Apache Flume is a reliable and distributed system for collecting, aggregating and moving massive quantities of log data. Apache Flume is used to collect log data present in log files from web servers and aggregating it into HDFS for analysis.











Hive:
Apache Hive, is an open source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hive do three main functions: data summarization, query, and analysis. Hive use language called HiveQL (HQL), which is similar to SQL.












PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is Query based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data sets.
Pig does the work of executing commands and in the background, all the activities of MapReduce are taken care of. After the processing, pig stores the result in HDFS.











Mahout:
Apache Mahout is an open source project that is primarily used for creating scalable machine learning algorithms.
It provides various libraries or functionalities such as collaborative filtering, clustering, and classification which are nothing but concepts of Machine learning. It allows invoking algorithms as per our need with the help of its own libraries.

R Connectors:
Oracle R Connector for Hadoop provides API access from a local R client to Hadoop, using these APIs: 
hadoop : Provides an interface to Hadoop MapReduce. 
hdfs : Provides an interface to HDFS. 
orhc : Provides an interface between the local R instance and Oracle Database.

Ambari:
Ambari, another Hadop ecosystem component, is a management platform for provisioning, managing, monitoring and securing apache Hadoop cluster. Hadoop management gets simpler as Ambari provide consistent, secure platform for operational control.











Zookeeper:
There was a huge issue of management of coordination and synchronization among the resources or the components of Hadoop which resulted in inconsistency, often. Zookeeper overcame all the problems by performing synchronization, inter-component based communication, grouping, and maintenance.











Oozie:
Oozie simply performs the task of a scheduler, thus scheduling jobs and binding them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie coordinator jobs. 
Oozie workflow is the jobs that need to be executed in a sequentially ordered manner areas 
Oozie Coordinator jobs are those that are triggered when some data or external stimulus is given to it.











Hbase:
Apache HBase is a Hadoop ecosystem component which is a distributed database that was designed to store structured data in tables that could have billions of row and millions of columns. HBase is scalable, distributed, and NoSQL database that is built on top of HDFS. HBase, provide real-time access to read or write data in HDFS.





HADOOP VENDORS:

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...