Pages

Sunday, 29 January 2023

Comparing SQL Databases and Hadoop


 

Hadoop 

SQL

Data Size

Petabytes.

Gigabytes.

Access

Batch.

Interactive & Batch.

Updates

Write once, read multiple times.

Read & Write multiple times.

Structure

Dynamic Schema.

Static Schema.

Integrity

Low. 

High.

Scaling

Linear.

Non Linear.


1. SCHEMA ON WRITE VS READ: 
Generally  in  a traditional database, during  data  load/migration from  one  database to another, it follows schema on Write  approach. This makes the data load process to get excited/ aborted and  results in rejection of records due to a difference in the structure of the source and target tables. 
Whereas in Hadoop system- all the data are stored in HDFS and  Data  are  centralized.  Hadoop  framework  is  mainly  used  for Data  Analytics process. Thus  it  supports  all  three  categories  of  data  i.e.  Structured,  semi-structured  and unstructured data and it enables Schema on reading approach

2. SCALABILITY & COST:
Hadoop  Framework  is  designed  to  process  a  large  volume  of  data. Whenever  the size of data increases, a number of additional resources like data node can be added to the cluster very easily than the traditional approaching of static memory allocation. Time and Budget  is  relatively  very  less  for  implementing  them  and  also  Hadoop  provides  Data Locality where the data is made available in the node that executed the job

3. FAULT TOLERANCE:
In  the  traditional RDBMS,  when  data  is  lost  due  to  corruption  or  any  network issue,  it takes  more  time,  cost  and resource  to get  back the lost data.  But,  Hadoop  has  a mechanism where the data has minimum three level of replication factor for the data that are stored in HDFS. If one of the data nodes that hold data gets failed, data can be easily pulled from other data nodes with high availability of data. Hence makes the data readily available to user irrespective of any failure. 

4. FUNCTIONAL PROGRAMMING:
Hadoop  supports  writing  functional  programming  in  languages  like java,scala,and python.For any application  that  requires  any  additional  functionality  can  be implemented  by  registering  UDF  (User  Defined  Functions)  in  the  HDFS.  
In  RDBMS, there is no possibility  of writing UDF and this increases the  complexity of  writing  SQL.
Moreover the data stored in HDFS can be accessed by  all the  ecosystem of Hadoop  like Hive,Pig,Sqoop  and  HBase. So,  if  the  UDF  is  written  it  can  be  used  by  any  of  the above mentioned  application.  

5. OPTIMIZATION:
Hadoop  stores  data  in  HDFS  and  Process  though Map  Reduce with  huge optimization  techniques.  
The  most  popular  techniques  used  for  handling  data  are  using partitioning and bucketing  of the  data  stored.  
Partitioning  is  an  approach  for  storing  the data in HDFS by splitting the data based on the column mentioned for partitioning. When the data is injected or loaded into HDFS, it identifies the partition column and pushes the data into the concerned partition directory.  So the query fetches the result set by directly fetching  the  data  from  the  partitioned  directory.  This  reduces  the  whole  table  scan, improves the response time and avoids latency. 
Another  approach  is  called  Bucketing  of  the  data.  This  enables  the  analyst  to easily distribute the data among the data nodes.  All nodes will have an equal number of data  distributed.  The  bucketing  column  is  selected  in  such  a  way  that  it  has  the  least number  of  cardinality.  These  approaches  are  not  available  in  the  Traditional  method  of SQL.

6. DATA TYPE:
In a traditional approach, the datatype supported are very limited. It supports only structured  data.  Thus  to  clean  and  format the  schema  of  data  itself  will  take  more  time. 
But,  Hadoop  supports  complex  data  type  like  Array,   Struct,  and  Map.  This  encourages using the different kinds of a dataset to be used for data load. For Ex: the XML data can be loaded by defining the data with XML elements containing complex data type.

7. DATA COMPRESSION:
There  are  very  less  inbuilt  compression  techniques  available  for  the  traditional database system. But for the Hadoop framework, there are many compression techniques like gzib, bzip2, LZO and snappy. 
The default compression mode is LZ4. Compression techniques  help  in  making  the  tables  to  occupy  very  less  space  increase  the  throughput and faster query execution.

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...