HBase is a column-oriented data storage architecture that is formed on top of HDFS to overcome its limitations.
Although the HBase architecture is a NoSQL (Not Only SQL) database, it eases the process of maintaining data by distributing it evenly across the cluster.
Row-oriented storages
As we all know traditional relational models store data in terms of row-based format like in terms of rows of data.
Row-oriented Database- Online Transactional process such as banking and finance domains use this approach.
- It is designed for a small number of rows and columns.
- Online Transactional process such as banking and finance domains use this approach.
- It is designed for a small number of rows and columns.
Column-oriented storages
Column-oriented storages store data
tables in terms of columns and column families.
Column-oriented Database | |
| |
|
The Data Model in HBase is designed to accommodate semi-structured data that could vary in field size, data type and columns. Additionally, the layout of the data model makes it easier to partition the data and distribute it across the cluster.
The Data Model in HBase is made of different logical components such as Tables, Rows, Column Families, Columns, Cells and Versions.
Tables :
The HBase Tables are more like logical collection of rows stored in separate partitions called Regions. As shown above, every Region is then served by exactly one Region Server. The figure above shows a representation of a Table.
Rows :
A row is one instance of data in a table and is identified by a rowkey. Rowkeys are unique in a Table and are always treated as a byte[].
Column Families :
Data in a row are grouped together as Column Families. Each Column Family has one more Columns and these Columns in a family are stored together in a low level storage file known as HFile. Column Families form the basic unit of physical storage to which certain HBase features like compression are applied. Hence it’s important that proper care be taken when designing Column Families in table.
The table above shows Students and Branch Column Families.
The Students Column Family is made up 2 columns – Name and Age
The Branch Column Families is made up to 2 columns – Bname and GPA.
Columns :
A Column Family is made of one or more columns. A Column is identified by a Column Qualifier that consists of the Column Family name concatenated with the Column name using a colon.
Ex : columnfamily:columnname.
There can be multiple Columns within a Column Family and Rows within a table can have varied number of Columns.
Cell :
A Cell stores data and is essentially a unique combination of rowkey, Column Family and the Column (Column Qualifier). The data stored in a Cell is called its value and the data type is always treated as byte[].
Version :
The data stored in a cell is versioned and versions of data are identified by the times tamp. The number of versions of data retained in a column family is configurable and this value by default is 3.
Run modes :
HBase has two run modes:
• Standalone mode :
In standalone mode, HBase does not use HDFS—it uses the local filesystem instead—and it runs all HBase daemons and a local ZooKeeper in the same JVM process. ZooKeeper binds to a well-known port so that clients may talk to HBase.
• Distributed mode (this is the default mode) :
The distributed mode can be further subdivided into pseudodistributed—all daemons run on a single node—and fully distributed—where the daemons are spread across multiple, physical servers in the cluster. Distributed modes require an instance of the Hadoop Distributed File System (HDFS).
The HBase architecture comprises three major components,
- HMaster
- Region Server
- ZooKeeper.
1. HMaster
HMaster operates similar to its name. It is the master that assigns regions to Region Server (slave). HBase architecture uses an Auto Sharing process to maintain data. In this process, whenever an HBase table becomes too long, it is distributed by the system with the help of HMaster. Some of the typical responsibilities of HMaster includes:
- Control the failover
- Manage the Region Server and Hadoop cluster
- Handle the DDL operations such as creating and deleting tables
- Manage changes in metadata operations
- Manage and assign regions to Region Servers
- Accept requests and sends it to the relevant Region Server
2. Region Server
Region Servers are the end nodes that handle all user requests. Several regions are combined within a single Region Server. These regions contain all the rows between specified keys. Handling user requests is a complex task to execute, and hence Region Servers are further divided into four different components to make managing requests seamless.
- Write-Ahead Log (WAL): WAL is attached to every Region Server and stores sort of temporary data that is not yet committed to the drive.
- Block Cache: It is a read request cache; all the recently read data is stored in block cache. Data that is not used often is automatically removed from the stock when it is full.
- MemStore: It is a write cache responsible for storing data not written to the disk yet.
- HFile: The HFile stores all the actual data after the commitment.
3. ZooKeeper
ZooKeeper acts as the bridge across the communication of the HBase architecture. It is responsible for keeping track of all the Region Servers and the regions that are within them. Monitoring which Region Servers and HMaster are active and which have failed is also a part of ZooKeeper’s duties. When it finds that a Server Region has failed, it triggers the HMaster to take necessary actions. On the other hand, if the HMaster itself fails, it triggers the inactive HMaster that becomes active after the alert. Every user and even the HMaster need to go through ZooKeeper to access Region Servers and the data within. ZooKeeper stores a.Meta file, which contains a list of all the Region Servers. ZooKeeper’s responsibilities include:
- Establishing communication across the Hadoop cluster
- Maintaining configuration information
- Tracking Region Server and HMaster failure
- Maintaining Region Server information
No comments:
Post a Comment