Apache ZooKeeper is an open-source coordination service for distributed applications. It exposes a simple set of operations that applications can build on for service discovery, dynamic configuration management, synchronization, and distributed locking. ZooKeeper is used to serialize tasks across clusters so that synchronization doesn’t have to be built separately into each service and project.
Components of the ZooKeeper architecture has been explained in the following table.
Client Client node in our distributed applications cluster is used to access information from the server. It sends a message to the server to let the server know that the client is alive, and if there is no response from the connected server the client automatically resends the message to another server.
Server The server gives an acknowledgement to the client to inform that the server is alive, and it provides all services to clients.
Leader If any of the server nodes is failed, this server node performs automatic recovery.
Follower It is a server node which follows the instructions given by the leader.
Working of Apache ZooKeeper:
- The first thing that happens as soon as the ensemble (a group of ZooKeeper servers) starts is, it waits for the clients to connect to the servers.
- After that, the clients in the ZooKeeper ensemble will connect to one of the nodes. That node can be any of a leader node or a follower node.
- Once the client is connected to a particular node, the node assigns a session ID to the client and sends an acknowledgement to that particular client.
- If the client does not get any acknowledgement from the node, then it resends the message to another node in the ZooKeeper ensemble and tries to connect with it.
- On receiving the acknowledgement, the client makes sure that the connection is not lost by sending the heartbeats to the node at regular intervals.
- Finally, the client can perform functions like read, write, or store the data as per the need.
HBase uses
zookeeper :
HBase uses ZooKeeper as a distributed coordination service for region assignments and to recover any region server crashes by loading them onto other region servers that are functioning. ZooKeeper is a centralized monitoring server that maintains configuration information and provides distributed synchronization. Whenever a client wants to communicate with regions, they have to approach Zookeeper first.
HMaster and Region servers are registered with ZooKeeper service, client needs to access ZooKeeper quorum in order to connect with region servers and HMaster. In case of node failure within an HBase cluster, ZKquoram will trigger error messages and start repairing failed nodes.
ZooKeeper service keeps track of all the region servers that are there in an HBase cluster- tracking information about how many region servers are there and which region servers are holding which DataNode. HMaster contacts ZooKeeper to get the details of region servers.
Various services that Zookeeper provides include :
• Establishing client communication with region servers.
• Tracking server failure and network partitions.
• Maintain Configuration Information
• Provides ephemeral nodes, which represent different region servers.
How to build applications with zookeeper :
ZooKeeper is a high-performance coordination service for distributed applications. It exposes common services - such as naming, configuration management, synchronization, and group services - in a simple interface so you don't have to write them from scratch. You can use it off-the-shelf to implement consensus, group management, leader election, and presence protocols. And you can build on it for your own, specific needs.
In a distributed ZooKeeper implementation, there are multiple servers. This is known as ZooKeeper’s Replicated Mode. One server is elected as the leader and all additional servers are followers. If the ZooKeeper leader fails, then a new leader is elected.
All ZooKeeper servers must know about each other. Each server maintains an in-memory image of the overall state as well as transaction logs and snapshots in persistent storage. Clients connect to just a single server, however, when a client is started, it can provide a list of servers. In that way, if the connection to server for that client fails, the client connects to the next server in its list. Since each server maintains the same information, the client is able to continue to function without interruption
A ZooKeeper client can perform a read operation from any server in the ensemble, however a write operation must go through the ZooKeeper leader and requires a majority consensus to succeed.
Zookeeper provides a flexible coordination infrastructure for distributed environment. ZooKeeper framework supports many of the today's best industrial applications.
• Yahoo!
The ZooKeeper framework was originally built at “Yahoo!”. A well-designed distributed application needs to meet requirements such as data transparency, better performance, robustness, centralized configuration, and coordination. So, they designed the ZooKeeper framework to meet these requirements.
• Apache Hadoop
Apache Hadoop is the driving force behind the growth of Big Data industry. Hadoop relies on ZooKeeper for configuration management and coordination. ZooKeeper provides the facilities for cross-node synchronization and ensures the tasks across Hadoop projects are serialized and synchronized.
Multiple ZooKeeper servers support large Hadoop clusters. Each client machine communicates with one of the ZooKeeper servers to retrieve and update its synchronization information.
Some of the real-time examples are
Human Genome Project − The Human Genome Project contains terabytes of data. Hadoop MapReduce framework can be used to analyze the dataset and find interesting facts for human development.
Healthcare − Hospitals can store, retrieve, and analyze huge sets of patient medical records, which are normally in terabytes.
• Apache HBase
Apache HBase is an open source, distributed, NoSQL database used for real-time read/write access of large datasets and runs on top of the HDFS. HBase follows master-slave architecture where the HBase Master governs all the slaves. Slaves are referred as Region servers.
HBase distributed application installation depends on a running ZooKeeper cluster. Apache HBase uses ZooKeeper to track the status of distributed data throughout the master and region servers with the help of centralized configuration management and distributed mutex mechanisms.
Here are some of the use-cases of HBase −
• Telecom − Telecom industry stores billions of mobile call records (around 30TB / month) and accessing these call records in real time become a huge task. HBase can be used to process all the records in real time, easily and efficiently.
• Social network − Similar to telecom industry, sites like Twitter, LinkedIn, and Facebook receive huge volumes of data through the posts created by users. HBase can be used to find recent trends and other interesting facts.
Apache Solr :
Apache Solr is a fast, open source search platform written in Java. It is a blazing fast, faulttolerant distributed search engine. Built on top of Lucene, it is a high-performance, full-featured text search engine.
Solr extensively uses every feature of ZooKeeper such as Configuration management, Leader election, node management, Locking and syncronization of data.
ZooKeeper contributes the following features −
• Add / remove nodes as and when needed
• Replication of data between nodes and subsequently minimizing data loss
• Sharing of data between multiple nodes and subsequently searching from multiple nodes for faster search results
Some of the use-cases of Apache Solr include e-commerce, job search, etc.