Pages

Wednesday, 30 June 2021

Zookeeper Architecture

Apache ZooKeeper is an open-source coordination service for distributed applications. It exposes a simple set of operations that applications can build on for service discovery, dynamic configuration management, synchronization, and distributed locking. ZooKeeper is used to serialize tasks across clusters so that synchronization doesn’t have to be built separately into each service and project.

Components of the ZooKeeper architecture has been explained in the following table.

Client Client node in our distributed applications cluster is used to access information from the server. It sends a message to the server to let the server know that the client is alive, and if there is no response from the connected server the client automatically resends the message to another server.

Server The server gives an acknowledgement to the client to inform that the server is alive, and it provides all services to clients.

Leader If any of the server nodes is failed, this server node performs automatic recovery.

Follower It is a server node which follows the instructions given by the leader.

Working of Apache ZooKeeper:

  • The first thing that happens as soon as the ensemble (a group of ZooKeeper servers) starts is, it waits for the clients to connect to the servers.
  • After that, the clients in the ZooKeeper ensemble will connect to one of the nodes. That node can be any of a leader node or a follower node.
  • Once the client is connected to a particular node, the node assigns a session ID to the client and sends an acknowledgement to that particular client.
  • If the client does not get any acknowledgement from the node, then it resends the message to another node in the ZooKeeper ensemble and tries to connect with it.
  • On receiving the acknowledgement, the client makes sure that the connection is not lost by sending the heartbeats to the node at regular intervals.
  • Finally, the client can perform functions like read, write, or store the data as per the need.

HBase uses zookeeper :


HBase uses ZooKeeper as a distributed coordination service for region assignments and to recover any region server crashes by loading them onto other region servers that are functioning. ZooKeeper is a centralized monitoring server that maintains configuration information and provides distributed synchronization. Whenever a client wants to communicate with regions, they have to approach Zookeeper first. 

HMaster and Region servers are registered with ZooKeeper service, client needs to access ZooKeeper quorum in order to connect with region servers and HMaster. In case of node failure within an HBase cluster, ZKquoram will trigger error messages and start repairing failed nodes.
ZooKeeper service keeps track of all the region servers that are there in an HBase cluster- tracking information about how many region servers are there and which region servers are holding which DataNode. HMaster contacts ZooKeeper to get the details of region servers. 
Various services that Zookeeper provides include :
Establishing client communication with region servers.
Tracking server failure and network partitions.
Maintain Configuration Information
Provides ephemeral nodes, which represent different region servers.
How to build applications with zookeeper :
ZooKeeper is a high-performance coordination service for distributed applications. It exposes common services - such as naming, configuration management, synchronization, and group services - in a simple interface so you don't have to write them from scratch. You can use it off-the-shelf to implement consensus, group management, leader election, and presence protocols. And you can build on it for your own, specific needs.

In a distributed ZooKeeper implementation, there are multiple servers. This is known as ZooKeeper’s Replicated Mode. One server is elected as the leader and all additional servers are followers. If the ZooKeeper leader fails, then a new leader is elected.
  All ZooKeeper servers must know about each other. Each server maintains an in-memory image of the overall state as well as transaction logs and snapshots in persistent storage. Clients connect to just a single server, however, when a client is started, it can provide a list of servers. In that way, if the connection to server for that client fails, the client connects to the next server in its list. Since each server maintains the same information, the client is able to continue to function without interruption
A ZooKeeper client can perform a read operation from any server in the ensemble, however a write operation must go through the ZooKeeper leader and requires a majority consensus to succeed.
Zookeeper provides a flexible coordination infrastructure for distributed environment. ZooKeeper framework supports many of the today's best industrial applications.
Yahoo!
The ZooKeeper framework was originally built at “Yahoo!”. A well-designed distributed application needs to meet requirements such as data transparency, better performance, robustness, centralized configuration, and coordination. So, they designed the ZooKeeper framework to meet these requirements.
Apache Hadoop
Apache Hadoop is the driving force behind the growth of Big Data industry. Hadoop relies on ZooKeeper for configuration management and coordination. ZooKeeper provides the facilities for cross-node synchronization and ensures the tasks across Hadoop projects are serialized and synchronized.
Multiple ZooKeeper servers support large Hadoop clusters. Each client machine communicates with one of the ZooKeeper servers to retrieve and update its synchronization information. 
Some of the real-time examples are 
Human Genome Project − The Human Genome Project contains terabytes of data. Hadoop MapReduce framework can be used to analyze the dataset and find interesting facts for human development.
Healthcare − Hospitals can store, retrieve, and analyze huge sets of patient medical records, which are normally in terabytes.
Apache HBase
Apache HBase is an open source, distributed, NoSQL database used for real-time read/write access of large datasets and runs on top of the HDFS. HBase follows master-slave architecture where the HBase Master governs all the slaves. Slaves are referred as Region servers.
HBase distributed application installation depends on a running ZooKeeper cluster. Apache HBase uses ZooKeeper to track the status of distributed data throughout the master and region servers with the help of centralized configuration management and distributed mutex mechanisms. 
Here are some of the use-cases of HBase −
Telecom − Telecom industry stores billions of mobile call records (around 30TB / month) and accessing these call records in real time become a huge task. HBase can be used to process all the records in real time, easily and efficiently.
Social network − Similar to telecom industry, sites like Twitter, LinkedIn, and Facebook receive huge volumes of data through the posts created by users. HBase can be used to find recent trends and other interesting facts.
Apache Solr :
Apache Solr is a fast, open source search platform written in Java. It is a blazing fast, faulttolerant distributed search engine. Built on top of Lucene, it is a high-performance, full-featured text search engine.
Solr extensively uses every feature of ZooKeeper such as Configuration management, Leader election, node management, Locking and syncronization of data.
ZooKeeper contributes the following features −
Add / remove nodes as and when needed
Replication of data between nodes and subsequently minimizing data loss
Sharing of data between multiple nodes and subsequently searching from multiple nodes for faster search results
Some of the use-cases of Apache Solr include e-commerce, job search, etc.


Tuesday, 22 June 2021

HBASE Architecture

HBase is a column-oriented data storage architecture that is formed on top of HDFS to overcome its limitations. 

Although the HBase architecture is a NoSQL (Not Only SQL) database, it eases the process of maintaining data by distributing it evenly across the cluster.  

Row-oriented storages

As we all know traditional relational models store data in terms of row-based format like in terms of rows of data. 

Row-oriented Database
  • Online Transactional process such as banking and finance domains use this approach.
  • It is designed for a small number of rows and columns.

Column-oriented storages

Column-oriented storages store data tables in terms of columns and column families.

Column-oriented Database
  • When the situation comes to process and analytics we use this approach. Such as Online Analytical Processing and it's applications.

  • The amount of data that can able to store in this model is very huge like in terms of petabytes

HBase (Schema )Data Model :

The Data Model in HBase is designed to accommodate semi-structured data that could vary in field size, data type and columns. Additionally, the layout of the data model makes it easier to partition the data and distribute it across the cluster. 

The Data Model in HBase is made of different logical components such as Tables, Rows, Column Families, Columns, Cells and Versions.

Tables :

The HBase Tables are more like logical collection of rows stored in separate partitions called Regions. As shown above, every Region is then served by exactly one Region Server. The figure above shows a representation of a Table.

Rows :

A row is one instance of data in a table and is identified by a rowkey. Rowkeys are unique in a Table and are always treated as a byte[].

Column Families :

Data in a row are grouped together as Column Families. Each Column Family has one more Columns and these Columns in a family are stored together in a low level storage file known as HFile. Column Families form the basic unit of physical storage to which certain HBase features like compression are applied. Hence it’s important that proper care be taken when designing Column Families in table.

The table above shows Students and Branch Column Families. 

The Students Column Family is made up 2 columns – Name and Age 

The Branch Column Families is made up to 2 columns – Bname and GPA.

Columns :

A Column Family is made of one or more columns. A Column is identified by a Column Qualifier that consists of the Column Family name concatenated with the Column name using a colon.

Ex : columnfamily:columnname. 

There can be multiple Columns within a Column Family and Rows within a table can have varied number of Columns.

Cell  :

A Cell stores data and is essentially a unique combination of rowkey, Column Family and the Column (Column Qualifier). The data stored in a Cell is called its value and the data type is always treated as byte[].

Version : 

The data stored in a cell is versioned and versions of data are identified by the times tamp. The number of versions of data retained in a column family is configurable and this value by default is 3.

Run modes :

HBase has two run modes: 

Standalone mode :

In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process. ZooKeeper binds to a well-known port so that clients may talk to HBase.

Distributed mode (this is the default mode) :

The distributed mode can be further subdivided into pseudodistributed—all daemons run on a single node—and fully distributed—where the daemons are spread across multiple, physical servers in the cluster. Distributed modes require an instance of the Hadoop Distributed File System (HDFS).

Components of HBase Architecture:

The HBase architecture comprises three major components, 

  • HMaster
  • Region Server
  • ZooKeeper.

1. HMaster

HMaster operates similar to its name. It is the master that assigns regions to Region Server (slave). HBase architecture uses an Auto Sharing process to maintain data. In this process, whenever an HBase table becomes too long, it is distributed by the system with the help of HMaster. Some of the typical responsibilities of HMaster includes:

  • Control the failover
  • Manage the Region Server and Hadoop cluster
  • Handle the DDL operations such as creating and deleting tables
  • Manage changes in metadata operations
  • Manage and assign regions to Region Servers
  • Accept requests and sends it to the relevant Region Server

2. Region Server

Region Servers are the end nodes that handle all user requests. Several regions are combined within a single Region Server. These regions contain all the rows between specified keys. Handling user requests is a complex task to execute, and hence Region Servers are further divided into four different components to make managing requests seamless.

  • Write-Ahead Log (WAL): WAL is attached to every Region Server and stores sort of temporary data that is not yet committed to the drive.
  • Block Cache: It is a read request cache; all the recently read data is stored in block cache. Data that is not used often is automatically removed from the stock when it is full.
  • MemStore: It is a write cache responsible for storing data not written to the disk yet.
  • HFile: The HFile stores all the actual data after the commitment.

3. ZooKeeper

ZooKeeper acts as the bridge across the communication of the HBase architecture. It is responsible for keeping track of all the Region Servers and the regions that are within them. Monitoring which Region Servers and HMaster are active and which have failed is also a part of ZooKeeper’s duties. When it finds that a Server Region has failed, it triggers the HMaster to take necessary actions. On the other hand, if the HMaster itself fails, it triggers the inactive HMaster that becomes active after the alert. Every user and even the HMaster need to go through ZooKeeper to access Region Servers and the data within. ZooKeeper stores a.Meta file, which contains a list of all the Region Servers. ZooKeeper’s responsibilities include:

  • Establishing communication across the Hadoop cluster
  • Maintaining configuration information
  • Tracking Region Server and HMaster failure
  • Maintaining Region Server information

Monday, 21 June 2021

How to run hadoop shell script and install hive on it within 5minutes

 HADOOP INSTALLATION:




 

 

 

 

HIVE INSTALLATION: 

hduser@ubuntu:~$ wget https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz 

hduser@ubuntu:~$ tar xzf apache-hive-3.1.2-bin.tar.gz

hduser@ubuntu:~$ nano .bashrc

 
 
 
 
 
 
 
hduser@ubuntu:~$ source ~/.bashrc
hduser@ubuntu:~$ nano $HIVE_HOME/bin/hive-config.sh

 
 
 
 
 
 
 
hduser@ubuntu:~$ hdfs dfs -mkdir /tmp
hduser@ubuntu:~$ hdfs dfs -chmod g+w /tmp
hduser@ubuntu:~$ hdfs dfs -ls /
 
hduser@ubuntu:~$ hdfs dfs -mkdir -p /user/hive/warehouse
hduser@ubuntu:~$ hdfs dfs -chmod g+w /user/hive/warehouse
hduser@ubuntu:~$ hdfs dfs -ls /user/hive
 

hduser@ubuntu:~$ cd $HIVE_HOME/conf
hduser@ubuntu:~/apache-hive-3.1.2-bin/conf$ 
hduser@ubuntu:~/apache-hive-3.1.2-bin/conf$ cp hive-default.xml.template hive-site.xml
hduser@ubuntu:~/apache-hive-3.1.2-bin/conf$ nano hive-site.xml
<property>
    <name>system:java.io.tmpdir</name>
    <value>/tmp/hive/java</value>
  </property>
  <property>
    <name>system:user.name</name>
    <value>${user.name}</value>
  </property>








hduser@ubuntu:~/apache-hive-3.1.2-bin/conf$ schematool -initSchema -dbType derby








To remove error we need to follow these steps
hduser@ubuntu:~/apache-hive-3.1.2-bin/conf$ ls $HIVE_HOME/lib
hduser@ubuntu:~/apache-hive-3.1.2-bin/conf$ ls $HADOOP_HOME/share/hadoop/hdfs/lib
hduser@ubuntu:~/apache-hive-3.1.2-bin/conf$ rm $HIVE_HOME/lib/guava-19.0.jar
hduser@ubuntu:~/apache-hive-3.1.2-bin/conf$ cp $HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar $HIVE_HOME/lib/
hduser@ubuntu:~/apache-hive-3.1.2-bin/conf$ schematool -initSchema -dbType derby
now identify the error caused by schematool







hduser@ubuntu:~/apache-hive-3.1.2-bin/conf$ nano hive-site.xml

NOTE:by pressing ctrl+t we can go to the error line

3225,96
 







hduser@ubuntu:~/apache-hive-3.1.2-bin/conf$ schematool -initSchema -dbType derby

















Thursday, 17 June 2021

HIVE-JOINS

Basically, there are 4 types of Hive Join. Such as:

HiveQL Select Joins 
Inner joins :  

The records common to the both tables will be retrieved by this inner join The simplest kind of join is the inner join, where each match in the input tables results in a row in the output.Consider the above said two tables  

Syntax:
SELECT table1.col1, table2.col1 
FROM table2 
JOIN table1 
ON (table1.matching_col = table2. matching_col);
Example 1:Using Create table method (multiple records can be loaded at the same time)
Table 1 (Students):
students.id students.name students.city
1	ABC	London 
2 BCD Mumbai
3 CDE Bangalore
4 DEF Mumbai
5 EFG         Bangalore

Table 2 (Dept): 

dept.did dept.name dept.sid 

100    CS    1
101    Maths    1
102    Physics    2
103    Chem    3
 
hive> create database joins ;
hive> use joins ;
 
TABLE1:
hive> create table student (sid int,sname string,city string)
    > row format delimited
    > fields terminated by '\t';
hive> load data local inpath '/home/hduser/Desktop/student.txt' into table student ;
hive> select * from student ;
 
TABLE2:
hive> create table dept (did int,dname string,dsid int)
    > row format delimited
    > fields terminated by '\t';
hive> load data local inpath '/home/hduser/Desktop/dept.txt' into table dept ;
hive> seelct * from dept ;
 
Example 1:
Query:

hive> desc student ;
OK
sid                     int                                        
sname               string                                     
city                    string                                     
Time taken: 0.071 seconds, Fetched: 3 row(s)
hive> desc dept ;
OK
did                     int                                        
dname               string                                     
dsid                   int                                        
Time taken: 0.072 seconds, Fetched: 3 row(s)
hive > select student.*,dept.*
    > from student JOIN dept
    > ON (student.sid = dept.dsid);

OK
1    ABC    London     100    CS    1
1    ABC    London     101    Maths    1
2    BCD    Mumbai     102    Physics    2
3    CDE    Bangalore     103    Chem    3

Example 2: Using Insert table method(only one record can be loaded into the table)

Table 1 (sales):
name    id 
ram        2
sita         4
raju        0     
rani        3
vani        

Table 2 (products): 

pid        pname
2        football
4        cricketball
3        hat     
1        bat 

Query:

hive> create database joins ;
hive> show databases ;
hive> use joins ; 
hive> create table sales (name string,id int);
hive> create table products (pid int ,pname string);
hive> show tables ;
hive> desc products ;
hive> desc sales ;
hive> desc products ;
hive> insert into table sales values('ram',2);
hive> insert into table sales values('sita',4);
hive> insert into table sales values('raju',0);
hive> insert into table sales values('rani',3;
hive> insert into table sales values('vani',2);
hive> select * from sales ;
OK
ram    2
sita    4
raju    0
rani    3
vani    2
hive> select * from products ;
OK
2    football
4    cricketball
3    hat
1    bat 
 
hive> select sales.*,products.*
    > from sales JOIN products
    > ON ( sales.id = products.pid);
OK
ram    2    2    football
vani    2    2    football
sita     4    4    cricketball
rani    3    3    hat
 

Left Outer Join

This type of join returns all rows from the left table even if there is no matching row in the right table. Table returns all rows from the left table and matching rows from right table. Unmatched right tables records will be NULL.

Syntax:

SELECT table1. col1, table2.col2 
from table1 LEFT OUTER JOIN table2
ON (table1.matching_col = table2.matching_col);

Example 1:

Query:

hive> select student.*,dept.*
    > from student LEFT OUTER JOIN dept
    > ON (student.sid = dept.dsid);

OK
1    ABC    London     100    CS    1
1    ABC    London     101    Maths    1
2    BCD    Mumbai     102    Physics    2
3    CDE    Bangalore     103    Chem    3
4    DEF    Mumbai     NULL    NULL    NULL
5    EFG    Bangalore    NULL    NULL    NULL

 

Example 2:

Query:

hive> create database joins ;
hive> show databases ;
hive> use joins ; 
hive> create table sales (name string,id int);
hive> create table products (pid int ,pname string);
hive> show tables ;
hive> desc products ;
hive> desc sales ;
hive> desc products ;
hive> insert into table sales values('ram',2);
hive> insert into table sales values('sita',4);
hive> insert into table sales values('raju',0);
hive> insert into table sales values('rani',3;
hive> insert into table sales values('vani',2);
hive> select * from sales ;
OK
ram    2
sita    4
raju    0
rani    3
vani    2
hive> select * from products ;
OK
2    football
4    cricketball
3    hat
1    bat

hive> select sales.*,products.*
    > from sales LEFT OUTER JOIN products
    > ON ( sales.id = products.pid);

OK
ram    2    2    football
sita    4    4    cricketball
raju    0    NULL    NULL
rani    3    3    hat
vani    2    2    football

 

Right Outer Join

It returns all rows from the right table even if there is no matching row in the left table. Table returns all rows from the right table and matching rows from left table. Unmatched left table records will be NULL.

Syntax:

SELECT table1. col1, table2.col2 
from table1 RIGHT OUTER JOIN table2
ON (table1.matching_col = table2.matching_col);

Query:

hive> select student.*,dept.*
    > from student RIGHT OUTER JOIN dept
    > ON (student.sid = dept.dsid);

OK
1    ABC    London     100    CS    1
1    ABC    London     101    Maths    1
2    BCD    Mumbai     102    Physics    2
3    CDE    Bangalore     103    Chem    3
 

Example 2:

Query:

hive> create database joins ;
hive> show databases ;
hive> use joins ; 
hive> create table sales (name string,id int);
hive> create table products (pid int ,pname string);
hive> show tables ;
hive> desc products ;
hive> desc sales ;
hive> desc products ;
hive> insert into table sales values('ram',2);
hive> insert into table sales values('sita',4);
hive> insert into table sales values('raju',0);
hive> insert into table sales values('rani',3;
hive> insert into table sales values('vani',2);
hive> select * from sales ;
OK
ram    2
sita    4
raju    0
rani    3
vani    2
hive> select * from products ;
OK
2    football
4    cricketball
3    hat
1    bat

hive> select sales.*,products.*
    > from sales RIGHT OUTER JOIN products
    > ON ( sales.id = products.pid);

OK
ram    2    2    football
vani    2    2    football
sita    4    4    cricketball
rani    3    3    hat
NULL    NULL    1    bat



Full Outer Join

It returns all rows from the both tables that fulfill the JOIN condition. The unmatched rows from both tables will be returned as a NULL.

Syntax:

SELECT table1. col1, table1.col2, table2.col1, table2.col2
FROM table1 
FULL OUTER JOIN table2 ON (table1.matching_col = table2.matching_col);

Results:

student_id student_name dept_id dept_name
1 ABC 100 CS 
1 ABC 101 Maths 
2 BCD 102 Physics 
3 CDE 103 Chem 
4 DEF NULL NULL
5 EFG NULL NULL
Example 2:

Query:

hive> create database joins ;
hive> show databases ;
hive> use joins ; 
hive> create table sales (name string,id int);
hive> create table products (pid int ,pname string);
hive> show tables ;
hive> desc products ;
hive> desc sales ;
hive> desc products ;
hive> insert into table sales values('ram',2);
hive> insert into table sales values('sita',4);
hive> insert into table sales values('raju',0);
hive> insert into table sales values('rani',3;
hive> insert into table sales values('vani',2);
hive> select * from sales ;
OK
ram    2
sita    4
raju    0
rani    3
vani    2
hive> select * from products ;
OK
2    football
4    cricketball
3    hat
1    bat

hive> select sales.*,products.*
    > from sales FULL OUTER JOIN products
    > ON ( sales.id = products.pid);

OK
raju    0    NULL    NULL
NULL    NULL    1    bat
vani    2    2    football
ram    2    2    football
rani    3    3    hat
sita    4    4    cricketball 

 
Example 3:

hive> desc sales  ;
OK
name                    string                                     
sid                     int                                        
Time taken: 0.107 seconds, Fetched: 2 row(s)
hive> desc products  ;
OK
pid                     int                                        
pname                   string                                     
Time taken: 0.17 seconds, Fetched: 2 row(s)
hive>

Inner Join:

Left outer Join:

 

Right outer Join:

Full outer Join:

 

 

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...