Pages

Monday 30 January 2023

Exp 1: Installation of HADOOP in STANDALONE MODE.

Local Mode or Stand alone Mode :

  • The default mode run by the Hadoop is the stand alone mode.
  • This mode is majorly used in case of debugging where HDFS will not be used.
  • In this mode both the input and the output can be used as local file system.
  • No Custom configuration required for mapred-site.xml, hdfs-site.xml and core-site.xml.
  • This mode is the fastest modes in Hadoop as the local file system is used for both the input and the output.

Installation of HADOOP in STANDALONE MODE:

Objective:

Standalone mode is the default mode of operation of Hadoop and it runs on a single node ( a node is your machine).

Software Requirements:
Oracle Virtual Box 5.x
Ubuntu Desktop OS 18.x(64bit)
Hadoop-3.1.0
OpenJdk version-8 Hardware Requirements:
Minimum RAM required: 4GB (Suggested: 8GB)
Minimum Free Disk Space: 25GB
Minimum Processor i3 or above

Analysis:

By default, Hadoop is configured to run in a non-distributed or standalone mode, as a single Java process. There are no daemons running and everything runs in a single JVM instance. HDFS is not used. We don't have to do anything as far as configuration is concerned, except the JAVA_HOME. Just download the tar file and unzip it

Installation Procedure:

Java Installation

1. sudo apt install openjdk-8-jdk

(java home : /usr/lib/jvm/Java-8-openjdk-amd64)

/* Check the path of java */

bdsa@bdsa-VirtualBox:~$ readlink -f /usr/bin/javac
/usr/lib/jvm/java-8-openjdk-amd64

Installation of HADOOP

Download the Hadoop file from hadoop. apache.org

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz





/* file extraction */

2. tar -zxvf hadoop-3.3.4.tar.gz


/* creation of hadoop home directory */

3. sudo mkdir /usr/lib/hadoop3


/* change ownership to hadoop3 */

4. sudo chown <username> /usr/lib/hadoop3

Note: Enter the given username  <username> in it 
for example:rk@rk-VirtualBox:~$ pwd
/home/rk
rk@rk-VirtualBox:~$ sudo chown rk /usr/lib/hadoop3

/* Move extracted file to hadoop home directory */

5. sudo mv hadoop-3.3.4/* /usr/lib/hadoop3

6. cd /usr/lib/hadoop3

cd /home/username (change directory to ubuntu home) 

pwd (present working directory)

/* Running Hadoop in standalone Mode from Hadoop Home Directory */

7.Set the path of java 

bdasa@bdasa-VirtualBox:/usr/lib/hadoop3$ readlink -f /usr/bin/javac

bdasa@bdasa-VirtualBox:/usr/lib/hadoop3$cd etc/

bdasa@bdasa-VirtualBox:/usr/lib/hadoop3/etc$ ls

hadoop

bdasa@bdasa-VirtualBox:/usr/lib/hadoop3/etc$ cd hadoop/

bdasa@bdasa-VirtualBox:/usr/lib/hadoop3/etc/hadoop$ ls

capacity-scheduler.xml            kms-log4j.properties
configuration.xsl                 kms-site.xml
container-executor.cfg            log4j.properties
core-site.xml                     mapred-env.cmd
hadoop-env.cmd                    mapred-env.sh
hadoop-env.sh                     mapred-queues.xml.template
hadoop-metrics2.properties        mapred-site.xml
hadoop-policy.xml                 shellprofile.d
hadoop-user-functions.sh.example  ssl-client.xml.example
hdfs-rbf-site.xml                 ssl-server.xml.example
hdfs-site.xml                     user_ec_policies.xml.template
httpfs-env.sh                     workers
httpfs-log4j.properties           yarn-env.cmd
httpfs-site.xml                   yarn-env.sh
kms-acls.xml                      yarnservice-log4j.properties
kms-env.sh                        yarn-site.xml
bdasa@bdasa-VirtualBox:/usr/lib/hadoop3/etc/hadoop$ nano hadoop-env.sh 



8. bin/hadoop


Limitations:
Standalone mode is the default mode of operation of Hadoop and it runs on a single node ( a node is your machine). HDFS and YARN doesn't run on standalone mode.

Conclusion:
Standalone Mode is the default operation of Hadoop Eco System where the hadoop services will run in the Single JVM. As in this experiment basic Java installation and extraction of the Hadoop files are sufficient to run the Hadoop services.

Sunday 29 January 2023

Comparing SQL Databases and Hadoop


 

Hadoop 

SQL

Data Size

Petabytes.

Gigabytes.

Access

Batch.

Interactive & Batch.

Updates

Write once, read multiple times.

Read & Write multiple times.

Structure

Dynamic Schema.

Static Schema.

Integrity

Low. 

High.

Scaling

Linear.

Non Linear.


1. SCHEMA ON WRITE VS READ: 
Generally  in  a traditional database, during  data  load/migration from  one  database to another, it follows schema on Write  approach. This makes the data load process to get excited/ aborted and  results in rejection of records due to a difference in the structure of the source and target tables. 
Whereas in Hadoop system- all the data are stored in HDFS and  Data  are  centralized.  Hadoop  framework  is  mainly  used  for Data  Analytics process. Thus  it  supports  all  three  categories  of  data  i.e.  Structured,  semi-structured  and unstructured data and it enables Schema on reading approach

2. SCALABILITY & COST:
Hadoop  Framework  is  designed  to  process  a  large  volume  of  data. Whenever  the size of data increases, a number of additional resources like data node can be added to the cluster very easily than the traditional approaching of static memory allocation. Time and Budget  is  relatively  very  less  for  implementing  them  and  also  Hadoop  provides  Data Locality where the data is made available in the node that executed the job

3. FAULT TOLERANCE:
In  the  traditional RDBMS,  when  data  is  lost  due  to  corruption  or  any  network issue,  it takes  more  time,  cost  and resource  to get  back the lost data.  But,  Hadoop  has  a mechanism where the data has minimum three level of replication factor for the data that are stored in HDFS. If one of the data nodes that hold data gets failed, data can be easily pulled from other data nodes with high availability of data. Hence makes the data readily available to user irrespective of any failure. 

4. FUNCTIONAL PROGRAMMING:
Hadoop  supports  writing  functional  programming  in  languages  like java,scala,and python.For any application  that  requires  any  additional  functionality  can  be implemented  by  registering  UDF  (User  Defined  Functions)  in  the  HDFS.  
In  RDBMS, there is no possibility  of writing UDF and this increases the  complexity of  writing  SQL.
Moreover the data stored in HDFS can be accessed by  all the  ecosystem of Hadoop  like Hive,Pig,Sqoop  and  HBase. So,  if  the  UDF  is  written  it  can  be  used  by  any  of  the above mentioned  application.  

5. OPTIMIZATION:
Hadoop  stores  data  in  HDFS  and  Process  though Map  Reduce with  huge optimization  techniques.  
The  most  popular  techniques  used  for  handling  data  are  using partitioning and bucketing  of the  data  stored.  
Partitioning  is  an  approach  for  storing  the data in HDFS by splitting the data based on the column mentioned for partitioning. When the data is injected or loaded into HDFS, it identifies the partition column and pushes the data into the concerned partition directory.  So the query fetches the result set by directly fetching  the  data  from  the  partitioned  directory.  This  reduces  the  whole  table  scan, improves the response time and avoids latency. 
Another  approach  is  called  Bucketing  of  the  data.  This  enables  the  analyst  to easily distribute the data among the data nodes.  All nodes will have an equal number of data  distributed.  The  bucketing  column  is  selected  in  such  a  way  that  it  has  the  least number  of  cardinality.  These  approaches  are  not  available  in  the  Traditional  method  of SQL.

6. DATA TYPE:
In a traditional approach, the datatype supported are very limited. It supports only structured  data.  Thus  to  clean  and  format the  schema  of  data  itself  will  take  more  time. 
But,  Hadoop  supports  complex  data  type  like  Array,   Struct,  and  Map.  This  encourages using the different kinds of a dataset to be used for data load. For Ex: the XML data can be loaded by defining the data with XML elements containing complex data type.

7. DATA COMPRESSION:
There  are  very  less  inbuilt  compression  techniques  available  for  the  traditional database system. But for the Hadoop framework, there are many compression techniques like gzib, bzip2, LZO and snappy. 
The default compression mode is LZ4. Compression techniques  help  in  making  the  tables  to  occupy  very  less  space  increase  the  throughput and faster query execution.

Data in the Warehouse and Data in Hadoop

In our experience, traditional warehouses are mostly ideal for analyzing structured data from various systems and producing insights with known and relatively stable measurements. On the other hand, we feel a Hadoop-based platform is well suited to deal with semi structured and unstructured  data,  as  well  as  when  a  data  discovery  process  is  needed. That  isn’t  to  say  that Hadoop can’t be used for structured data that is readily available in a raw format. 
Processing Structured Data is something that your traditional database is already very good  at.  After  all,  structured  data,  by  definition  is  easy  to  enter,  store,  query  and  analyze.  It conforms nicely to a fixed schema model of neat columns and rows that can be manipulated by Structured Query Language (SQL) to establish relationships. As such, using Hadoop to process semi-structured and unstructured data is raw, complex,  and  pours  in  from  multiple  sources  such  as  emails,  text  documents,  videos,  photos, social media posts, Twitter feeds, sensors and click streams. 

Storing,  managing  and  analyzing  massive  volumes  of  semi-structured  and unstructured data is what Hadoop was purpose-built to do. Hadoop as a Service provides a scalable solution to meet ever-increasing data storage and processing demands that the data warehouse can no longer handle. With its unlimited scale and on-demand  access  to  compute  and  storage capacity.  Hadoop  as  a  Service  is  the  perfect  match for big data processing. 

Keeping  costs down is a concern for every business And  traditional  relational  databases  are  certainly  cost  effective.  If  you  are  considering adding Hadoop  to  your  data  warehouse,  it’s  important  to  make  sure  that  your  company’s  big  data demands are genuine and that the potential benefits to be realized  from implementing Hadoop will outweigh the costs.

Shorter time-to-insight necessitates interactive querying via the analysis of smaller data sets  in  near  or  real-time.  And  that’s  a task  that  the  data  warehouse  has  been  well  equipped  to handle. However, thanks to a powerful Hadoop processing engine called Spark, Hive,Hbase .. 

How to install Hadoop 3.3.0 on ubuntu using shell script

Installing Hadoop:

Installation of Hadoop can be done in 3 different modes:

  1. Stand alone mode - Single Node Cluster
  2. Pseudo distributed mode - Single Node Cluster
  3. Distributed mode - Multi Node Cluster

Local Mode or Stand alone Mode :

  • The default mode run by the Hadoop is the stand alone mode.
  • This mode is majorly used in case of debugging where HDFS will not be used.
  • In this mode both the input and the output can be used as local file system.
  • No Custom configuration required for mapred-site.xml, hdfs-site.xml and core-site.xml.
  • This mode is the fastest modes in Hadoop as the local file system is used for both the input and the output.

Pseudo-distributed Mode

  • The pseudo-distribute mode is also defined as a single-node cluster in which the NameNode and DataNode will exist in the same machine.
  • Configuration files such as mapred-site.xml, hdfs-site.xml and core-site.xml are required.
  • If all the Hadoop daemons will be executing on a single node then it is called pseudo-distributed mode. 
  • Both Master and Slave nodes are on the same machine.
  • A separate Java Virtual Machine (JVM) is created for all Hadoop components and these components communicate across network sockets. Due to this a fully functioning and optimized mini-cluster on a single host will produced effectively.

Fully-Distributed Mode

  • In this mode multiple nodes will executed so that it is called production mode of Hadoop. This is also called Multi-Node Cluster.
  • Data is distributed across several nodes on a Hadoop cluster.
  • In this mode Master and Slave services will be running on the separate nodes

Note: We need to run this shell script after once we access root access only 

How to install Hadoop 3.3.0 on ubuntu using shell script

#!/bin/bash
user()
 {
    uid=`id -u`
    if [ $uid -eq "0" ]; then
        echo " "
    else
        exit 
   fi
}
echo -n "Enter new group for hadoop user:"
read hdgroup
echo -n "Enter username for hadoop user:"
read hduser
echo "Adding user to group"
sudo addgroup $hdgroup
sudo adduser -ingroup $hdgroup $hduser
sleep 10
echo "$hdgroup is created and $hduser is assigned to the group"
updates_check() {
echo "checking for updates please wait..."
sudo apt-get update && apt-get upgrade -y && apt-get install ssh
}
java_version_check() {
echo "Checking for supported java version please wait..."
java_version=$(java -version 2>&1 > /dev/nul | grep version | awk '{print substr($3,4, length($3)-9);}'| tr -d ".")
if [ $java_version -eq "8" ];then
echo "system has 8 installed and supported for hadoop"
java_home=$(which java)
path=$(readlink -f $java_home | cut -c 1-33)
echo $path
else
java_installation_check
fi
}
java_installation_check() {
sudo apt-get remove java-common
echo "please wait.."
sudo apt install openjdk-8-jdk -y
java_version_check
}
ssh_keys_creation(){
sudo -u $hduser ssh-keygen -t rsa -P ""
sudo -u $hduser cat /home/$hduser/.ssh/id_rsa.pub >> /home/$hduser/.ssh/authorized_keys
sleep 10
echo "ssh Keys created"
}
hd_install() {
download() {
                wget http://archive.apache.org/dist/hadoop/core/hadoop-3.3.0/hadoop-3.3.0.tar.gz
                sleep 2
                tar xvfz hadoop-3.3.0.tar.gz
                sleep 2
                mv hadoop-3.3.0 /home/$hduser/hadoop
}
if [ -f "/home/$hduser/hadoop-3.3.0.tar.gz" ]; then
echo "Download Already exists... using the existing......."
sleep 2
tar xvfz hadoop-3.3.0.tar.gz
sleep 5
echo "Existing Dir deleted..."
rm -rf /home/$hduser/hadoop
sleep 5
mv hadoop-3.3.0 /home/$hduser/hadoop
else
download
fi 
chown -R $hduser:$hdgroup /home/$hduser/hadoop
#tmp folder for furthur process
sudo -u $hduser mkdir -p /home/$hduser/hadoop/app/hadoop/tmp
sudo -u $hduser chown -R $hduser:$hdgroup /home/$hduser/hadoop/app/hadoop/tmp
#namenode and datanode
sudo -u $hduser mkdir -p /home/$hduser/hadoop/hadoop_store/hdfs/namenode
sudo -u $hduser mkdir -p /home/$hduser/hadoop/hadoop_store/hdfs/datanode
#changing owner to name and data node
sudo -u $hduser chown -R $hduser:$hdgroup /home/$hduser/hadoop/hadoop_store
#permission to .bashrc,hadoop-env.sh,core-site.xml,mapred.site.xml,hdfs-site.xml,yarn-site.xml
sudo -u $hduser chmod o+w /home/$hduser/.bashrc
sudo -u $hduser chmod o+w /home/$hduser/hadoop/etc/hadoop/hadoop-env.sh
sudo -u $hduser chmod o+w /home/$hduser/hadoop/etc/hadoop/core-site.xml
sudo -u $hduser chmod o+w /home/$hduser/hadoop/etc/hadoop/mapred-site.xml
sudo -u $hduser chmod o+w /home/$hduser/hadoop/etc/hadoop/hdfs-site.xml
sudo -u $hduser chmod o+w /home/$hduser/hadoop/etc/hadoop/yarn-site.xml
 
echo "export JAVA_HOME=$path" >> /home/$hduser/hadoop/etc/hadoop/hadoop-env.sh

#bashrc
echo -e '\n\n #Hadoop Variable START \n export HADOOP_HOME=/home/'$hduser'/hadoop \n export HADOOP_INSTALL=$HADOOP_HOME \n export HADOOP_MAPRED_HOME=$HADOOP_HOME \n export HADOOP_COMMON_HOME=$HADOOP_HOME \n export HADOOP_HDFS_HOME=$HADOOP_HOME \n export YARN_HOME=$HADOOP_HOME \n export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native \n export PATH=$PATH:$HADOOP_HOME/sbin/:$HADOOP_HOME/bin \n export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib/native \n #Hadoop Variable END\n\n' >> /home/$hduser/.bashrc
source /home/$hduser/.bashrc
#core-site.xml
sudo sed -i '/<configuration>/a <property>\n\t\t<name>hadoop.tmp.dir</name>\n\t\t<value>/home/'$hduser'/hadoop/app/hadoop/tmp</value>\n</property>\n<property>\n\t\t<name>fs.default.name</name>\n\t\t<value>hdfs://localhost:9000</value>\n</property>' /home/$hduser/hadoop/etc/hadoop/core-site.xml
#mapred-site.xml
sudo sed -i '/<configuration>/a <property>\n\t\t <name>mapreduce.framework.name</name>\n\t\t <value>yarn</value>\n</property>' /home/$hduser/hadoop/etc/hadoop/mapred-site.xml
#hdfs-site.xml
sudo sed -i '/<configuration>/a <property>\n\t\t<name>dfs.data.dir</name>\n\t\t<value>/home/'$hduser'/hadoop/dfsdata/namenode</value>\n</property>\n<property>\n\t\t<name>dfs.data.dir</name>\n\t\t<value>/home/'$hduser'/hadoop/dfsdata/datanode</value>\n</property>\n\t\t<property><name>dfs.replication</name>\n\t\t<value>1</value>\n</property>' /home/$hduser/hadoop/etc/hadoop/hdfs-site.xml
#yarn-site.xml
sudo sed -i '/<configuration>/a <property>\n\t\t<name>yarn.nodemanager.aux-services</name>\n\t\t<value>mapreduce_shuffle</value>\n</property>\n<property>\n\t\t<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>\n\t\t<value>org.apache.hadoop.mapred.ShuffleHandler</value>\n</property>\n\t\t<property><name>yarn.resourcemanager.hostname</name>\n\t\t<value>127.0.0.1</value>\n</property>\n\t\t<property>\n\t\t<name>yarn.acl.enable</name>\n\t\t<value>0</value>\n</property>\n\t\t<property>\n\t\t<name>yarn.nodemanager.env-whitelist</name>\n\t\t<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>\n</property>' /home/$hduser/hadoop/etc/hadoop/yarn-site.xml
#revoking permissions
sudo -u $hduser chmod o-w /home/$hduser/hadoop/etc/hadoop/hadoop-env.sh
sudo -u $hduser chmod o-w /home/$hduser/hadoop/etc/hadoop/core-site.xml
sudo -u $hduser chmod o-w /home/$hduser/hadoop/etc/hadoop/mapred-site.xml
sudo -u $hduser chmod o-w /home/$hduser/hadoop/etc/hadoop/hdfs-site.xml
sudo -u $hduser chmod o-w /home/$hduser/hadoop/etc/hadoop/yarn-site.xml
#hadoop dir
sudo ls /home/$hduser/hadoop
#ssh
sudo -u hduser ssh localhost
}
user
updates_check
java_version_check
ssh_keys_creation
hd_install


After you installed the hadoop using above shell script now identity the path hadoop/bin and then type the following command 

hadoop namenode -format

After formatting the namenode ,now identity the path hadoop/sbin and then type the following command 

 ./start-all.sh

or

./start-dfs.sh
./start-yarn.sh

After starting all the 5 daemons on the said path hadoop/sbin now type the following command 

jps (Java Virtual Machine Process Status)





The pseudo-distribute mode is also defined as a single-node cluster in which the NameNode and DataNode will exist in the same machine.

Thursday 26 January 2023

Patterns For Big data Development

1. IT for IT Log Analytics:

Log analytics is a common use case for an inaugural Big Data project. We like to refer to all those logs and trace data that are generated by the operation of your IT solutions as data exhaust. IT  departments  need  logs  at  their  disposal,  and  today  they  just  can’t  store  enough  logs  and analyze them in a cost-efficient manner, so logs are typically kept for emergencies and discarded as soon as possible. Another reason why IT departments keep large amounts of data in logs is to look for rare problems. It is often the case that the most common problems are known and easy to  deal  with,  but  the  problem  that  happens "only once in a while is typically more difficult  to diagnose and prevent from occurring again".

The  nature  of  these  logs  is  semi  structured  and  raw,  so  they  aren’t  always  suited  for traditional database processing. In addition, log formats are constantly changing due to hardware and software upgrades.

Log analytics is actually a pattern that IBM established after working with a number of companies, including some large financial services sector (FSS) companies. We’ve seen this use case come up with quite a few customers since for that reason, we’ll call this pattern IT for IT.  

NOTE: 
step 1: create a log file using notepad and save it
.LOG
22:25 27/10/2014
Today's last entry 

step 2: now double click the created  log file 
.LOG
22:25 27/10/2014
Today's last entry 
22:26 27/10/2014

2. The Fraud Detection Pattern: 

Fraud  detection  comes  up  a  lot  in  the  financial  services  and you’ll find it in any sort of claims or transaction based on online auctions,insurance-claims,environment etc.  


Traditionally, in fraud cases, samples and models are used to identify customers that characterize a certain kind of profile. In  our  customer  experiences,  we  estimate  that  only  20  percent  (or  maybe  less)  of  the available information that could be useful for fraud modeling is actually being used. You can use BigInsights to provide an elastic and cost-effective repository to establish the remaining 80 percent of the information is useful for fraud modeling.


Typically, fraud detection works after a transaction gets stored only to get pulled out of storage and analyzed.
 As we can see that 80 percent of the data is costeffective and efficient in using the BigInsights platform. Now, of course, once you have your fraud models built, you’ll want to put them into action to try and prevent the fraud in the first place. Recovery rates for fraud are dismal in all industries, so it’s best to prevent it versus discover it and try to recover the funds post-fraud. 

Think  about  fraud  in  health  care  markets  (health  insurance  fraud,  drug  fraud,  medical fraud, and so on) and the ability to get in front of insurer and government fraud schemes. when  the  Federal  Bureau  of Investigation (FBI) estimates that health care fraud costs U.S. taxpayers over $60 billion a year. Think  about  fraudulent  online  product  or  ticket  sales,  money  transfers,  swiped  banking  cards, and more you can see that the applicability of this usage pattern is extreme.








Note: 

1.Traditional fraud detection methods with out big data


2.Traditional fraud detection methods with  big data

3. Big Data and the Energy Sector: 

The energy sector provides many Big Data use case challenges in how to deal with the massive  volumes  of  sensor  data  from  remote  installations.  Many  companies  are  using  only  a fraction of the data being collected, because they lack the infrastructure to store or analyze the available scale of data. 

Take for example a typical oil drilling platform that can have 20,000 to 40,000 sensors on board.  All  of  these  sensors  are  streaming  data  about  the  health  of  the  oil  rig,  quality  of operations,  and  so  on.  Not  every  sensor  is  actively  broadcasting  at  all  times,  but  some  are reporting back many times per second. Now take a guess at what percentage of those sensors are actively utilized. 

The location chosen to install and operate a wind turbine can obviously greatly impact the amount of power generated by  the  unit,  as  well  as  how  long  it’s  able  to  remain  in  operation.  To  determine  the  optimal placement for a wind turbine, a large number of location-dependent factors must be considered, such  as  temperature,  precipitation,  wind  velocity,  humidity,  atmospheric  pressure,  and  more.    

4.The Social Media Pattern: 

The data produced by social media is very huge and if we see the report of Visual Capitalist then it shows that every internet minute:

701,389 logins on Facebook

150 million emails sent

$203,596 in sales on Amazon.com

120+ new Linkedin accounts

347,222 tweets on Twitter

2.4 million search queries on Google

2.78 million video views on YouTube

20.8 million messages on WhatsApp  

5. The Call Center Mantra: <This Call May Be Recorded for Quality Assurance Purposes>: 

Call center analytics is the collection, measurement, and reporting of performance metrics within a contact center. It tracks call data and agent performance handling inbound or outbound calls. Common types of analytics include handle time, call volume, customer satisfaction, and hold time.

In most cases, call center supervisors can access this data using specialized analytics tools. However, accessing this call center data analytics is often limited to supervisors and team leads. More modern contact centers provide this real-time data to agents so they can mind increasing call volumes.

However, with the right tools and strategy, call data helps you provide exceptional customer experience, boost brand loyalty, and improve efficiency across the board.

6. Risk:

The risks of Big Data are manifold, and organizations need to carefully plan for their use of Big Data solutions. These risks include strategic and business risks, such as operational impacts and cost overruns, as well as technical risks, such as data quality and security. 

Here are the five biggest risks of Big Data projects 

  • Security
  • Privacy
  • Costs
  • Bad Analytics
  • Bad Data

Monday 9 January 2023

Patterns for Big Data Designs

These big data design patterns aim to reduce complexity, boost the performance of integration and improve the results of working with new and larger forms of data.the common big data design patterns based on various data layers such as data sources and ingestion layer, data storage layer and data access layer.

Data sources and ingestion layer:
Enterprise big data systems face a variety of data sources with non-relevant information (noise) alongside relevant (signal) data. Noise ratio is very high compared to signals, and so filtering the noise from the pertinent information, handling high volumes, and the velocity of data is significant. This is the responsibility of the ingestion layer. 
The common challenges in the ingestion layers are as follows:
  • Multiple data source load and prioritization
  • Ingested data indexing and tagging
  • Data validation and cleansing
  • Data transformation and compression
The above diagram depicts the building blocks of the ingestion layer and its various components. We need patterns to address the challenges of data sources to ingestion layer communication that takes care of performance, scalability, and availability requirements.
Multisource extractor
Multidestination
Protocol converter
Just-in-time (JIT) transformation
Real-time streaming pattern

Multisource extractor
An approach to ingesting multiple data types from multiple data sources efficiently is termed a Multisource extractor. Efficiency represents many factors, such as data velocity, data size, data frequency, and managing various data formats over an unreliable network, mixed network bandwidth, different technologies , and systems.

The multisource extractor system ensures high availability and distribution. It also confirms that the vast volume of data gets segregated into multiple batches across different nodes. The single node implementation is still helpful for lower volumes from a handful of clients, and of course, for a significant amount of data from multiple clients processed in batches. Partitioning into small volumes in clusters produces excellent results.

Data enrichers help to do initial data aggregation and data cleansing. Enrichers ensure file transfer reliability, validations, noise reduction, compression, and transformation from native formats to standard formats. Collection agent nodes represent intermediary cluster systems, which helps final data processing and data loading to the destination systems.The following are the benefits of the multisource extractor:
  • Provides reasonable speed for storing and consuming the data
  • Better data prioritization and processing
  • Drives improved business decisions
  • Decoupled and independent from data production to data consumption
  • Data semantics and detection of changed data
  • Scaleable and fault tolerance system
Multidestination pattern:
The multidestination pattern is considered as a better approach to overcome all of the challenges mentioned previously. This pattern is very similar to multisourcing until it is ready to integrate with multiple destinations (refer to the following diagram). The router publishes the improved data and then broadcasts it to the subscriber destinations (already registered with a publishing agent on the router). Enrichers can act as publishers as well as subscribers:
The following are the benefits of the multidestination pattern:
  • Highly scalable, flexible, fast, resilient to data failure, and cost-effective
  • Organization can start to ingest data into multiple data stores, including its existing RDBMS as well as NoSQL data stores
  • Allows you to use simple query language, such as Hive and Pig, along with traditional analytics
  • Provides the ability to partition the data for flexible access and decentralized processing
  • Possibility of decentralized computation in the data nodes
  • Due to replication on HDFS nodes, there are no data regrets
  • Self-reliant data nodes can add more nodes without any delay 
Protocol converter:
This is a mediatory approach to provide an abstraction for the incoming data of various systems. The protocol converter pattern provides an efficient way to ingest a variety of unstructured data from multiple data sources and different protocols.

The message exchanger handles synchronous and asynchronous messages from various protocol and handlers as represented in the following diagram. It performs various mediator functions, such as file handling, web services message handling, stream handling, serialization, and so on:
In the above protocol converter pattern, the ingestion layer holds responsibilities such as identifying the various channels of incoming events, determining incoming data structures, providing mediated service for multiple protocols into suitable sinks, providing one standard way of representing incoming messages, providing handlers to manage various request types, and providing abstraction from the incoming protocol layers.

Just-In-Time (JIT) transformation pattern:
The JIT transformation pattern is the best fit in situations where raw data needs to be preloaded in the data stores before the transformation and processing can happen. In this kind of business case, this pattern runs independent preprocessing batch jobs that clean, validate, corelate, and transform, and then store the transformed information into the same data store (HDFS/NoSQL); that is, it can coexist with the raw data:

Please note that the data enricher of the multi-data source pattern is absent in this pattern and more than one batch job can run in parallel to transform the data as required in the big data storage, such as HDFS, Mongo DB, and so on.

Real-time streaming pattern:
The real-time streaming pattern suggests introducing an optimum number of event processing nodes to consume different input data from the various data sources and introducing listeners to process the generated events (from event processing nodes) in the event processing engine:
Event processing engines (event processors) have a sizeable in-memory capacity, and the event processors get triggered by a specific event. The trigger or alert is responsible for publishing the results of the in-memory big data analytics to the enterprise business process engines and, in turn, get redirected to various publishing channels (mobile, CIO dashboards, and so on).

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...