Pages

Saturday, 29 May 2021

How to run WordCount.java Program on hadoop - Conventional Method

# Program on WordCount.java

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount
{
  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>
   {

    private Text word = new Text();
    private final static IntWritable one = new IntWritable(1);
    public void map(Object key, Text value, Context context)
                      throws IOException, InterruptedException
    {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens())
      {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer
    extends Reducer<Text,IntWritable,Text,IntWritable>
  {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
                  throws IOException, InterruptedException
    {
      int sum = 0;
      for (IntWritable val : values)
      {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args)
     throws Exception
   {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}
 

How to run WordCount.java Program on hadoop  
Step 1:  create a directory rkmapreduce on the local file system and then place the  file WordCount.java  in it as shown

Step 2: create a class files to the WordCount.java file using the following command

javac -classpath `hadoop classpath` -d /home/hduser/rkmapreduce/ WordCount.java

Step 3: create a jar file wc.jar in the directory rkmapreduce as shown 

jar -cvf wc.jar -C /home/hduser/rkmapreduce/ .


Step 4: create an input file input.txt that we need to check the wordcount of it as shown

Step 5: Now place the created input file input.txt into hadoop cluster by creating a directory rkmr in it as shown

hdfs dfs -mkdir  /rkmr 

hdfs dfs -put /home/hduser/rkmapreduce/input.txt  /rkmr

Now we can check the content of the file rkmr/input.txt using the command

hdfs dfs -cat /rkmr/input.txt


Step 6: Now run the command hadoop jar to WordCount the file as shown 

 hadoop jar wc.jar WordCount /rkmr/input.txt /rkmr/output


Step 7: Now we can check the output of wordcount program using the following command

hdfs dfs -lsr /rkmr

hdfs dfs -cat /rkmr/output/part*

Step 8: using fsck command you can check the details of the stored file in hdfs/hadoop
hdfs fsck /rkmr/input.txt


Tuesday, 18 May 2021

HIVE - Create Alter & Drop Tables

CREATE:
we can create a table in two ways
1.Internal or managed table
2.External table

hive> create database rkstd;
hive> show databases; 

we can see using webui by typing localhost:9870 in the url

Path to find rkstd.db: /user/hive/warehouse

hive> create table clstd(sid int,sname string)
    > row format delimited
    > fields terminated by ',';

hive> show tables;

hive> use rkstd;

hive> create table clwstd(sid int,sname string)
    > row format delimited
    > fields terminated by ',';


hive> load data local inpath '/home/rk/Desktop/cstd' into table clwstd ;

hive> select * from clwstd ;

hive> load data local inpath '/home/rk/Desktop/cstd1' into table clwstd ;

hive> select * from clwstd ;

hive> describe <or> desc clwstd;

hive> describe extended clwstd ; // metadata

ALTER: 
alter table:

hive> alter table clstd rename to clstd_int ;

external table:

hive> create external table clwstd_ext(sno int, sname string)
    > row format delimited
    > fields terminated by ',' ;

hive> show tables ;

hive> load data local inpath '/home/rk/Desktop/cstd' into table clwstd_ext ;

hive> select * from clwstd_ext;

hive> show tables ;
OK
clwstd_ext
clwstd_int
Time taken: 0.054 seconds, Fetched: 2 row(s)

DROP:
drop table:

hive> drop table clwstd_ext ;
OK
Time taken: 0.892 seconds
hive> drop table clwstd_int ;
OK
Time taken: 0.206 seconds
hive> show tables ;
OK
Time taken: 0.024 seconds

Note: when we try to droped internal and external tables created in the same database  we observed that only internal table will be deleted i.e clwstd_int


 
How to view the content of the file from hive  
  
INSERT:
How to insert the content of data into the file(clinst_int)  
 
Method -1:
hive> create table clinst_int(sno int, sname string)
    > row format delimited
    > fields terminated by ',' ;

hive> show tables ;
hive> select * from clinst_int ;
hive> load data local inpath '/home/rk/Desktop/cstd' into table clinst_int ;
hive> select * from clinst_int;
OK
111    raju
222    rani
444    vani
Time taken: 0.169 seconds, Fetched: 3 row(s)

Method -2:

hive> use rk ;
hive> create table s_cluster ( sno int, sname string);
hive> show tables ;
hive> desc s_cluster ;
hive> insert into table s_cluster values (11,'rama');
hive> insert into table s_cluster values (12,'sita');
hive> select * from s_cluster ;

Monday, 17 May 2021

HIVE Installation on top of Hadoop

Apache Hive is an enterprise data warehouse system used to query, manage, and analyze data stored in the HDFS
 
The Hive Query Language (HiveQL) facilitates queries in a Hive command-line interface shell. Hadoop can use HiveQL as a bridge to communicate with relational database management systems and perform tasks based on SQL-like commands. 
 
To configure Apache Hive, first you need to download and unzip Hive. Then you need to customize the following files and settings:
  • Edit .bashrc file
  • Edit hive-config.sh file
  • Create Hive directories in HDFS
  • Configure hive-site.xml file
  • Initiate Derby database 

Step 1: Download and Untar Hive  

Download the compressed Hive files using wget command

wget https://downloads.apache.org/hive/hive-3.1.2/apa
che-hive-3.1.2-bin.tar.gz

Once the download process is complete, untar the compressed Hive package:

tar xzf apache-hive-3.1.2-bin.tar.gz

Step 2: Configure Hive Environment Variables (bashrc)

Edit the .bashrc shell configuration file using using nano

hduser@rk-virtual-machine:~$ nano .bashrc
 

Save and exit the .bashrc file once you add the Hivevariables.

hduser@rk-virtual-machine:~$ source ~/.bashrc 

Step 3: Edit hive-config.sh file

hduser@rk-virtual-machine:~$ cd apache-hive-3.1.2-bin/
hduser@rk-virtual-machine:~/apache-hive-3.1.2-bin$ cd bin
hduser@rk-virtual-machine:~/apache-hive-3.1.2-bin/bin$ ls

 

Step 4: Create Hive Directories in HDFS

Create two separate directories to store data in the HDFS layer:

  1. The temporary, tmp directory is going to store the intermediate results of Hive processes.
  2. The warehouse directory is going to store the  Hive related tables

1.Create tmp Directory

Create a tmp directory within the HDFS storage layer. This directory is going to store the intermediary data Hive sends to the HDFS:

hdfs dfs -mkdir /tmp

Add write and execute permissions to tmp group members:

hdfs dfs -chmod g+w /tmp

Check if the permissions were added correctly:

hdfs dfs -ls /

The output confirms that users now have write and execute permissions.


2.Create warehouse Directory

Create the warehouse directory within the /user/hive/ parent directory:

hdfs dfs -mkdir -p /user/hive/warehouse

Add write and execute permissions to warehouse group members:

hdfs dfs -chmod g+w /user/hive/warehouse

Check if the permissions were added correctly:

hdfs dfs -ls /user/hive

The output confirms that users now have write and execute permissions.

Step 5: Configure hive-site.xml File (Optional)

Use the following command to locate the correct file:

hduser@rk-virtual-machine:~/apache-hive-3.1.2-bin$ cd conf/

Use the hive-default.xml.template to create the hive-site.xml file:

cp hive-default.xml.template hive-site.xml

Access the hive-site.xml file using the nano text editor:

sudo nano hive-site.xml

add this file in the beginning 
  <property>
    <name>system:java.io.tmpdir</name>
    <value>/tmp/hive/java</value>
  </property>
  <property>
    <name>system:user.name</name>
    <value>${user.name}</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:,databaseName=$HIVE_HOME/metastore_db;create=true</value>
    <description>JDBC connect string for a JDBC metastore </description>
  </property>
 
                                  (OR)
 
<property>
    <name>system:java.io.tmpdir</name>
    <value>/tmp/hive/java</value>
  </property>
  <property>
    <name>system:user.name</name>
    <value>${user.name}</value>
  </property>
 
and add another file in middle as shown 
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>

Note : After running the command schematool –initSchema –dbType derby we need to set guava jar as shown below then you need to open the file hive-site.xml and then we need identify the error usually the error may occur at 3224 line of the code you need to set by deleting the error and then run the command 

schematool -initSchema -dbType derby

Step 6: Initiate Derby Database

Initiate the Derby database, from the Hive bin directory using the schematool command:

hduser@rk-virtual-machine:~/apache-hive-3.1.2-bin/bin$ 
schematool -initSchema -dbType derby

The process can take a few moments to complete.

The schematool command has initiated the Derby database.

Derby is the default metadata store for Hive. If you plan to use a different database solution, such as MySQL or PostgreSQL, you can specify a database type in the hive-site.xml file.

How to Fix guava Incompatibility Error in Hive

If the Derby database does not successfully initiate,  you might receive an error with the following content:

“Exception in thread “main” java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)V”

This error indicates that there is most likely an incompatibility issue between Hadoop and Hive guava versions.

Locate the guava jar file in the Hive lib directory:

ls $HIVE_HOME/lib

Location of Hive guava jar file.

Locate the guava jar file in the Hadoop lib directory as well:

ls $HADOOP_HOME/share/hadoop/hdfs/lib

Location of Hadoop guava jar file.

The two listed versions are not compatible and are causing the error. Remove the existing guava file from the Hive lib directory:

rm $HIVE_HOME/lib/guava-19.0.jar

Copy the guava file from the Hadoop lib directory to the Hive lib directory:

cp $HADOOP_HOME/share/hadoop/hdfs/lib/guava-27.0-jre.jar $HIVE_HOME/lib/

Use the schematool command once again to initiate the Derby database:

$HIVE_HOME/bin/schematool –initSchema –dbType derby

Launch Hive Client Shell on Ubuntu

Start the Hive command-line interface using the following commands:

cd $HIVE_HOME/bin
hive

You are now able to issue SQL-like commands and directly interact with HDFS.


HIVE Installation Screenshots on top of Hadoop 

  

At the beginning add this code : 

1.Put the following at the beginning of hive-site.xml

  <property>
    <name>system:java.io.tmpdir</name>
    <value>/tmp/hive/java</value>
  </property>
  <property>
    <name>system:user.name</name>
    <value>${user.name}</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:,databaseName=$HIVE_HOME/metastore_db;create=true</value>
    <description>JDBC connect string for a JDBC metastore </description>
  </property>
 

At the middle add this code : 
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
<description>location of default database for the warehouse</description>
</property>
 

How to Fix guava Incompatibility Error in Hive:

If Still error occurs then change the file hive-site.xml 
which is located at 

ctrl +w
ctrl + t - go to line number 3224 


Launch Hive Client Shell on Ubuntu:

if error occurs in installation of hive :

This error occurs when hive-shell started before metastore_db 

service. To avoid this just delete or move your metastore_db 

and try the below command.

$ mv metastore_db metastore_db.tmp

$ schematool -dbType derby -initSchema

$./bin/hive

 

 


Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...