BigData: Comparing SQL Databases and Hadoop

Sunday, 29 January 2023

Comparing SQL Databases and Hadoop

	Hadoop	SQL
Data Size	Petabytes.	Gigabytes.
Access	Batch.	Interactive & Batch.
Updates	Write once, read multiple times.	Read & Write multiple times.
Structure	Dynamic Schema.	Static Schema.
Integrity	Low.	High.
Scaling	Linear.	Non Linear.

1. SCHEMA ON WRITE VS READ:

Generally in a traditional database, during data load/migration from one database to another, it follows schema on Write approach. This makes the data load process to get excited/ aborted and results in rejection of records due to a difference in the structure of the source and target tables.

Whereas in Hadoop system- all the data are stored in HDFS and Data are centralized. Hadoop framework is mainly used for Data Analytics process. Thus it supports all three categories of data i.e. Structured, semi-structured and unstructured data and it enables Schema on reading approach

2. SCALABILITY & COST:

Hadoop Framework is designed to process a large volume of data. Whenever the size of data increases, a number of additional resources like data node can be added to the cluster very easily than the traditional approaching of static memory allocation. Time and Budget is relatively very less for implementing them and also Hadoop provides Data Locality where the data is made available in the node that executed the job

3. FAULT TOLERANCE:

In the traditional RDBMS, when data is lost due to corruption or any network issue, it takes more time, cost and resource to get back the lost data. But, Hadoop has a mechanism where the data has minimum three level of replication factor for the data that are stored in HDFS. If one of the data nodes that hold data gets failed, data can be easily pulled from other data nodes with high availability of data. Hence makes the data readily available to user irrespective of any failure.

4. FUNCTIONAL PROGRAMMING:

Hadoop supports writing functional programming in languages like java,scala,and python.For any application that requires any additional functionality can be implemented by registering UDF (User Defined Functions) in the HDFS.

In RDBMS, there is no possibility of writing UDF and this increases the complexity of writing SQL.

Moreover the data stored in HDFS can be accessed by all the ecosystem of Hadoop like Hive,Pig,Sqoop and HBase. So, if the UDF is written it can be used by any of the above mentioned application.

5. OPTIMIZATION:

Hadoop stores data in HDFS and Process though Map Reduce with huge optimization techniques.

The most popular techniques used for handling data are using partitioning and bucketing of the data stored.

Partitioning is an approach for storing the data in HDFS by splitting the data based on the column mentioned for partitioning. When the data is injected or loaded into HDFS, it identifies the partition column and pushes the data into the concerned partition directory. So the query fetches the result set by directly fetching the data from the partitioned directory. This reduces the whole table scan, improves the response time and avoids latency.

Another approach is called Bucketing of the data. This enables the analyst to easily distribute the data among the data nodes. All nodes will have an equal number of data distributed. The bucketing column is selected in such a way that it has the least number of cardinality. These approaches are not available in the Traditional method of SQL.

6. DATA TYPE:

In a traditional approach, the datatype supported are very limited. It supports only structured data. Thus to clean and format the schema of data itself will take more time.

But, Hadoop supports complex data type like Array, Struct, and Map. This encourages using the different kinds of a dataset to be used for data load. For Ex: the XML data can be loaded by defining the data with XML elements containing complex data type.

7. DATA COMPRESSION:

There are very less inbuilt compression techniques available for the traditional database system. But for the Hadoop framework, there are many compression techniques like gzib, bzip2, LZO and snappy.

The default compression mode is LZ4. Compression techniques help in making the tables to occupy very less space increase the throughput and faster query execution.

BigData

Pages

Sunday, 29 January 2023

Comparing SQL Databases and Hadoop

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Report Abuse

Labels