Pages

Monday 3 May 2021

Moving data IN and OUT of Hadoop :

FILE READ OPERATION:


FILE WRITE OPERATION:


Moving data in and out of Hadoop refer as data ingress and egress which is the process by which data is transported from an external system into an internal system, and vice versa.

Hadoop supports ingress and egress at a low level in HDFS and MapReduce. Files can be moved in and out of HDFS, and data can be pulled from external data sources and pushed to external data sinks using MapReduce.


Key elements of ingress and egress :
        Moving large quantities of data in and out of Hadoop has logistical challenges that include consistency guarantees and resource impacts on data sources and destinations.
IDEMPOTENCE
An idempotent operation produces the same result no matter how many times it’s executed.
AGGREGATION
The data aggregation process combines multiple data elements. In the context of data ingress this can be useful because moving large quantities of small files into HDFS potentially translates into NameNode memory woes, as well as slow MapReduce execution times. Having the ability to aggregate files or data together mitigates this problem, and is a feature to consider.
DATA FORMAT TRANSFORMATION
The data format transformation process converts one data format into another.
RECOVERABILITY
Recoverability allows an ingress or egress tool to retry in the event of a failed operation. Because it’s unlikely that any data source, sink, or Hadoop itself can be 100 percent available, it’s important that an ingress or egress action be retried in the event of failure.
CORRECTNESS
In the context of data transportation, checking for correctness is how you verify that no data corruption occurred as the data was in transit. Common methods for checking correctness of raw data such as storage devices include Cyclic Redundancy Checks (CRC), which are what HDFS uses internally to maintain block-level integrity.
RESOURCE CONSUMPTION AND PERFORMANCE
Resource consumption and performance are measures of system resource utilization and system efficiency respectively. Ingress and egress tools don’t typically incur significant load (resource consumption) on a system, unless you have appreciable data volumes. For performance, the questions to ask include whether the tool performs ingress and egress activities in parallel, and if so, what mechanisms it provides to tune the amount of parallelism. For example, if your data source is a production database, don’t use a large number of concurrent map tasks to import data.
MONITORING
Monitoring ensures that functions are performing as expected in automated systems. For data ingress and egress, monitoring breaks down into two elements: ensuring that the process(es) involved in ingress and egress are alive, and validating that source and destination data are being produced as expected.

Data Serialization :
Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persistent storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be de-serialized again.

Serialization in Hadoop :
Generally in distributed systems like Hadoop, the concept of serialization is used for Inter-process Communication and Persistent Storage.
Inter-Process Communication
  • To establish the Inter-Process Communication between the nodes connected in a network
Persistent Storage
  • Persistent Storage is a digital storage facility that does not lose its data with the loss of power supply.
Serializing the Data in Hadoop :
The procedure to serialize the integer type of data is :
  • Instantiate IntWritable class by wrapping an integer value in it.
  • Instantiate ByteArrayOutputStream class.
  • Instantiate DataOutputStream class and pass the object of ByteArrayOutputStream class to it.
  • Serialize the integer value in IntWritable object using write() method. This method needs an object of DataOutputStream class.
  • The serialized data will be stored in the byte array object which is passed as parameter to the DataOutputStream class at the time of instantiation. Convert the data in the object to byte array.
Advantage :
  • Hadoop’s Writable-based serialization is capable of reducing the object-creation overhead by reusing the Writable objects, which is not possible with the Java’s native serialization framework.

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...