BigData: Hadoop Archives

Monday, 6 March 2023

Hadoop Archives

HAR (Hadoop Archive) Files deal with small file issue. HAR has introduced a layer on top of HDFS, which provide interface for file accessing. Using Hadoop archive command we can create HAR files. These file runs a MapReduce job to pack the archived files into a smaller

number of HDFS files. Reading through files in as HAR is not more efficient than reading through files in HDFS. Since each HAR file access requires two index files read as well the data file to read, this makes it slower.

Sequence Files also deal with small file problem. In this, we use the filename as key and the file contents as the value. If we have 10,000 files of 100 KB, we can write a program to put them into a single sequence file. And then we can process them in a streaming fashion.

Hadoop archives are special format archives. A Hadoop archive maps to a file system directory. A Hadoop archive always has a *. har extension.

How to Create an Archive

hadoop archive -archiveName name -p <parent> <src> <dest>

-archiveName is the name of the archive you would like to create. An example would be foo.har. The name should have a *.har extension. The parent argument is to specify the relative path to which the files should be archived to.

Example would be :

-p /foo/bar a/b/c e/f/g

Here /foo/bar is the source (parent path) and a/b/c, e/f/g are relative (destination) paths to parent.

Note that this is a Map/Reduce job that creates the archives. You would need a map reduce cluster to run this. example the later sections.

If you just want to archive a single directory /foo/bar then you can just use

hadoop archive -archiveName zoo.har -p /foo/bar /outputdir

How to Look Up Files in Archives

The archive exposes itself as a file system layer. So all the fs shell commands in the archives work but with a different URI. Also, note that archives are immutable. So, renames, deletes and creates return an error. URI (Universal Resource Identifier) for Hadoop Archives is

har://scheme-hostname:port/archivepath/fileinarchive

If no scheme is provided it assumes the underlying filesystem. In that case the URI (Universal Resource Identifier) would look like

har:///archivepath/fileinarchive

Example on Hadoop Archive:

Step 1: create a folder rkhar inside that folder create 2 files ram and sita with some text as shown

Step 2: copy the folder rkhar into hadoop distributed system using the command

hduser@ubuntu:~$hadoop fs -copyFromLocal rkhar /

Step 3: check out the copied folder rkhar in hadoop distributed system using the command

hduser@ubuntu:~$hadoop fs -lsr /rkhar

Step 4: create a har (hadoop archive file) here it is filerk.har and move it into some other folder (tanvee) created under distributed file system using the syntax

hduser@ubuntu:~$hadoop archive -archiveName filerk.har -p /rkhar /ammu

Step 5:Now to see the archive file we use the following

command

hduser@ubuntu:~$hadoop fs -lsr har:///ammu/filerk.har

Note: HAR, is a JSON-formatted archive file format for logging of a web browser's interaction with a site.

Step 6: stop all the daemons

HAR LIMITATION:

currently doesn't support compression
Hadoop Archives are immutable

distcp:

This command is used to copy the file from one hadoop file system to another hadoop file system,the copy process is done in a parallel process

Syntax:

hadoop distcp hdfs:/namenode1/sourcefolder hdfs:/namenode2/destfolder

BigData

Pages

Monday, 6 March 2023

Hadoop Archives

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Report Abuse

Labels