Introduction: Apache Hive is also one of the technologies that are being used to address the requirements at Facebook.It is being used to run thousands of jobs on the cluster with hundreds of users, for a wide variety of applications.
Apache Hive-Hadoop cluster at Facebook stores more than 2PB of raw data. It regularly loads 15 TB of data on a daily basis.
Hive is an open source data warehouse system built on top of Hadoop for querying and analyzing large datasets stored in Hadoop files. It process structured and semi-structured data in Hadoop. To write complex Map-Reduce jobs, with the help of Hive, just need to submit merely SQL queries.
Hive is mainly targeted towards users who are comfortable with SQL. Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL automatically translates SQL-like queries into MapReduce jobs. Apache Hive organizes data into tables. This provides a means for attaching the structure to data stored in HDFS.
Apache Hive saves developers from writing complex Hadoop MapReduce jobs for ad-hoc requirements. Hence, hive provides summarization, analysis, and query of data. Hive is very fast and scalable. It is highly extensible. Since Apache Hive is similar to SQL, hence it becomes very easy for the SQL developers to learn and implement Hive Queries.
Apache Hive is an ETL(extract, transform, load) and Data warehousing tool built on top of Hadoop. It makes job easy for performing operations like
• Analysis of huge datasets
• Ad-hoc queries
• Data encapsulation
Difference between Hive and SQL:
• SQL is based on a relational database model whereas HQL is a combination of object-oriented programming with relational database concepts.
• SQL manipulates data stored in tables and modifies its rows and columns. HQL is concerned about objects and its properties.
• SQL is concerned about the relationship that exists between two tables while HQL considers the relation between two objects.
Features of Hive :
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
HIVE ARCHITECTURE:
COMPONENTS USED IN HIVE:
Hive Clients :
Apache Hive supports all application written in languages like C++, Java, Python etc. using JDBC, Thrift and ODBC drivers. Thus, one can easily write Hive client application written in a language of their choice.
The Hive supports different types of client applications for performing queries. These clients are categorized into 3 types:
Thrift Clients – As Apache Hive server is based on Thrift( framework for scalable cross-language services development), so it can serve the request from all those languages that support Thrift.
JDBC (Java Data Base Connectivity) Clients – Apache Hive allows Java applications to connect to it using JDBC driver. It is defined in the class apache.hadoop.hive.jdbc.HiveDriver.
ODBC (Object Data Base Connectivity)Clients – ODBC Driver allows applications that support ODBC protocol to connect to Hive. For example JDBC driver, ODBC uses Thrift to communicate with the Hive server.
Hive Services :
Hive provides various services like web Interface, CLI etc. to perform queries.
CLI(Command Line Interface) – This is the default shell that Hive provides, in which you can execute your Hive queries and command directly.
Web Interface – Hive also provides web based GUI for executing Hive queries and commands.See here different Hive Data types and operators.
Hive Server – It is built on Apache Thrift and thus is also called as Thrift server. It allows different clients to submit requests to Hive and retrieve the final result.
Hive Driver – Driver is responsible for receiving the queries submitted Thrift, JDBC, ODBC, CLI, Web UL interface by a Hive client.
Complier –After that hive driver passes the query to the compiler. Where parsing, type checking, and semantic analysis takes place with the help of schema present in the metastore.
Optimizer – It generates the optimized logical plan in the form of a DAG (Directed Acyclic Graph) of MapReduce and HDFS tasks.
Executor – Once compilation and optimization complete, execution engine executes these tasks in the order of their dependencies using Hadoop.
Metastore – Metastore is the central repository of Apache Hive metadata in the Hive Architecture. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API.
Hive metastore consists of two fundamental units:
• A service that provides metastore access to other Apache Hive services.
• Disk storage for the Hive metadata which is separate from HDFS storage.
No comments:
Post a Comment