HAR (Hadoop Archive) Files deal with small file issue. HAR has introduced a layer on top of HDFS, which provide interface for file accessing. Using Hadoop archive command we can create HAR files. These file runs a MapReduce job to pack the archived files into a smaller
number of HDFS files. Reading through files in as HAR is not more efficient than reading through files in HDFS. Since each HAR file access requires two index files read as well the data file to read, this makes it slower.
Sequence Files also deal with small file problem. In this, we use the filename as key and the file contents as the value. If we have 10,000 files of 100 KB, we can write a program to put them into a single sequence file. And then we can process them in a streaming fashion.
Hadoop archives are special format archives. A Hadoop archive maps to a file system directory. A Hadoop archive always has a *. har extension.
How to Create an Archive
hadoop archive -archiveName name -p <parent> <src> <dest>
-archiveName is the name of the archive you would like to create.
An example would be foo.har. The name should have a *.har extension.
The parent argument is to specify the relative path to which the files should be
archived to.
Example would be :
-p /foo/bar a/b/c e/f/g
Here /foo/bar is the source (parent path) and a/b/c, e/f/g are relative (destination) paths to parent.
Note that this is a Map/Reduce job that creates the archives. You would
need a map reduce cluster to run this. example the later sections.
If you just want to archive a single directory /foo/bar then you can just use
hadoop archive -archiveName zoo.har -p /foo/bar /outputdir
How to Look Up Files in Archives
The archive exposes itself as a file system layer. So all the fs shell
commands in the archives work but with a different URI. Also, note that
archives are immutable. So, renames, deletes and creates return
an error. URI (Universal Resource Identifier) for Hadoop Archives is
har://scheme-hostname:port/archivepath/fileinarchive
If no scheme is provided it assumes the underlying filesystem. In that case the URI (Universal Resource Identifier) would look like
har:///archivepath/fileinarchive
Example on Hadoop Archive:
Step 1: create a folder rkhar inside that folder create 2 files ram and sita with some text as shown
Step 2: copy the folder rkhar into hadoop distributed system using the command
hduser@ubuntu:~$hadoop fs -copyFromLocal rkhar /
Step 3: check out the copied folder rkhar in hadoop distributed system using the command
hduser@ubuntu:~$hadoop fs -lsr /rkhar
Step 4: create a har (hadoop archive file) here it is filerk.har and move it into some other folder (tanvee) created under distributed file system using the syntax
Step 4: create a har (hadoop archive file) here it is filerk.har and move it into some other folder (tanvee) created under distributed file system using the syntax
hduser@ubuntu:~$hadoop archive -archiveName filerk.har -p /rkhar /ammu
Step 5:Now to see the archive file we use the following
command
hduser@ubuntu:~$hadoop fs -lsr har:///ammu/filerk.har
Note: HAR, is a JSON-formatted archive file format for logging of a web browser's interaction with a site.
Step 6: stop all the daemons
HAR LIMITATION:
- currently doesn't support compression
- Hadoop Archives are immutable
distcp:
This command is used to copy the file from one hadoop file system to another hadoop file system,the copy process is done in a parallel process
Syntax:
hadoop distcp hdfs:/namenode1/sourcefolder hdfs:/namenode2/destfolder
No comments:
Post a Comment