Pages

Monday, 6 March 2023

Hadoop Archives

HAR (Hadoop Archive) Files deal with small file issue. HAR has introduced a layer on top of HDFS, which provide interface for file accessing. Using Hadoop archive command we can create HAR files. These file runs a MapReduce job to pack the archived files into a smaller 
number of HDFS files. Reading through files in as HAR is not more efficient than reading through files in HDFS. Since each HAR file access requires two index files read as well the data file to read, this makes it slower.

Sequence Files also deal with small file problem. In this, we use the filename as key and the file contents as the value. If we have 10,000 files of 100 KB, we can write a program to put them into a single sequence file. And then we can process them in a streaming fashion.

Hadoop archives are special format archives. A Hadoop archive maps to a file system directory. A Hadoop archive always has a *. har extension.  
 
How to Create an Archive 
hadoop archive -archiveName name -p <parent> <src>  <dest> 
 
-archiveName is the name of the archive you would like to create. An example would be foo.har. The name should have a *.har extension. The parent argument is to specify the relative path to which the files should be archived to.
 
Example would be : 
 
-p /foo/bar a/b/c e/f/g 
 
Here /foo/bar is the source (parent path) and a/b/c, e/f/g are relative (destination) paths to parent.
 
Note that this is a Map/Reduce job that creates the archives. You would need a map reduce cluster to run this. example the later sections. 
 
If you just want to archive a single directory /foo/bar then you can just use 
hadoop archive -archiveName zoo.har -p /foo/bar /outputdir
 
How to Look Up Files in Archives 
The archive exposes itself as a file system layer. So all the fs shell commands in the archives work but with a different URI. Also, note that archives are immutable. So, renames, deletes and creates return an error. URI (Universal Resource Identifier) for Hadoop Archives is 
 
har://scheme-hostname:port/archivepath/fileinarchive 
 
If no scheme is provided it assumes the underlying filesystem. In that case the URI (Universal Resource Identifier) would look like 
 
har:///archivepath/fileinarchive
 
Example on Hadoop Archive: 
Step 1: create a folder rkhar inside that folder create 2 files ram and sita with some text as shown
Step 2: copy the folder rkhar into hadoop distributed system using the command 
hduser@ubuntu:~$hadoop fs -copyFromLocal rkhar / 
Step 3: check out the copied folder rkhar in hadoop distributed system using the command
hduser@ubuntu:~$hadoop fs -lsr /rkhar
Step 4: create a har (hadoop archive file) here it is filerk.har and move it into some other folder (tanvee) created under distributed file system using the syntax
hduser@ubuntu:~$hadoop archive -archiveName filerk.har -p /rkhar /ammu
Step 5:Now to see the archive file we use the following 
command  
hduser@ubuntu:~$hadoop fs -lsr  har:///ammu/filerk.har

Note: HAR, is a JSON-formatted archive file format for logging of a web browser's interaction with a site.
Step 6: stop all the daemons
 
HAR LIMITATION:
  • currently doesn't support compression
  • Hadoop Archives are immutable
distcp:
This command is used to copy the file from one hadoop file system to another hadoop file system,the copy process is done in a parallel process
Syntax:
hadoop distcp hdfs:/namenode1/sourcefolder hdfs:/namenode2/destfolder

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...