BigData: January 2023

Monday, 30 January 2023

Exp 1: Installation of HADOOP in STANDALONE MODE.

Local Mode or Stand alone Mode :

The default mode run by the Hadoop is the stand alone mode.
This mode is majorly used in case of debugging where HDFS will not be used.
In this mode both the input and the output can be used as local file system.
No Custom configuration required for mapred-site.xml, hdfs-site.xml and core-site.xml.
This mode is the fastest modes in Hadoop as the local file system is used for both the input and the output.

Installation of HADOOP in STANDALONE MODE:

Objective:

Standalone mode is the default mode of operation of Hadoop and it runs on a single node ( a node is your machine).

Software Requirements:
• Oracle Virtual Box 5.x
• Ubuntu Desktop OS 18.x(64bit)
• Hadoop-3.1.0
• OpenJdk version-8 Hardware Requirements:
• Minimum RAM required: 4GB (Suggested: 8GB)
• Minimum Free Disk Space: 25GB

• Minimum Processor i3 or above

Analysis:

By default, Hadoop is configured to run in a non-distributed or standalone mode, as a single Java process. There are no daemons running and everything runs in a single JVM instance. HDFS is not used. We don't have to do anything as far as configuration is concerned, except the JAVA_HOME. Just download the tar file and unzip it

Installation Procedure:

Java Installation

1. sudo apt install openjdk-8-jdk

(java home : /usr/lib/jvm/Java-8-openjdk-amd64)

/* Check the path of java */

bdsa@bdsa-VirtualBox:~$ readlink -f /usr/bin/javac
/usr/lib/jvm/java-8-openjdk-amd64

Installation of HADOOP

Download the Hadoop file from hadoop. apache.org

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

/* file extraction */

2. tar -zxvf hadoop-3.3.4.tar.gz

/* creation of hadoop home directory */

3. sudo mkdir /usr/lib/hadoop3

/* change ownership to hadoop3 */

4. sudo chown <username> /usr/lib/hadoop3

Note: Enter the given username <username> in it

for example:rk@rk-VirtualBox:~$ pwd

/home/rk

rk@rk-VirtualBox:~$ sudo chown rk /usr/lib/hadoop3

/* Move extracted file to hadoop home directory */

5. sudo mv hadoop-3.3.4/* /usr/lib/hadoop3

6. cd /usr/lib/hadoop3

cd /home/username (change directory to ubuntu home)

pwd (present working directory)

/* Running Hadoop in standalone Mode from Hadoop Home Directory */

7.Set the path of java

bdasa@bdasa-VirtualBox:/usr/lib/hadoop3$ readlink -f /usr/bin/javac

bdasa@bdasa-VirtualBox:/usr/lib/hadoop3$cd etc/

bdasa@bdasa-VirtualBox:/usr/lib/hadoop3/etc$ ls

hadoop

bdasa@bdasa-VirtualBox:/usr/lib/hadoop3/etc$ cd hadoop/

bdasa@bdasa-VirtualBox:/usr/lib/hadoop3/etc/hadoop$ ls

capacity-scheduler.xml            kms-log4j.properties
configuration.xsl                 kms-site.xml
container-executor.cfg            log4j.properties
core-site.xml                     mapred-env.cmd
hadoop-env.cmd                    mapred-env.sh
hadoop-env.sh                     mapred-queues.xml.template
hadoop-metrics2.properties        mapred-site.xml
hadoop-policy.xml                 shellprofile.d
hadoop-user-functions.sh.example ssl-client.xml.example
hdfs-rbf-site.xml                 ssl-server.xml.example
hdfs-site.xml                     user_ec_policies.xml.template
httpfs-env.sh                     workers
httpfs-log4j.properties           yarn-env.cmd
httpfs-site.xml                   yarn-env.sh
kms-acls.xml                      yarnservice-log4j.properties
kms-env.sh                        yarn-site.xml
bdasa@bdasa-VirtualBox:/usr/lib/hadoop3/etc/hadoop$ nano hadoop-env.sh

8. bin/hadoop

Limitations:
Standalone mode is the default mode of operation of Hadoop and it runs on a single node ( a node is your machine). HDFS and YARN doesn't run on standalone mode.

Conclusion:
Standalone Mode is the default operation of Hadoop Eco System where the hadoop services will run in the Single JVM. As in this experiment basic Java installation and extraction of the Hadoop files are sufficient to run the Hadoop services.

Sunday, 29 January 2023

Comparing SQL Databases and Hadoop

	Hadoop	SQL
Data Size	Petabytes.	Gigabytes.
Access	Batch.	Interactive & Batch.
Updates	Write once, read multiple times.	Read & Write multiple times.
Structure	Dynamic Schema.	Static Schema.
Integrity	Low.	High.
Scaling	Linear.	Non Linear.

1. SCHEMA ON WRITE VS READ:

Generally in a traditional database, during data load/migration from one database to another, it follows schema on Write approach. This makes the data load process to get excited/ aborted and results in rejection of records due to a difference in the structure of the source and target tables.

Whereas in Hadoop system- all the data are stored in HDFS and Data are centralized. Hadoop framework is mainly used for Data Analytics process. Thus it supports all three categories of data i.e. Structured, semi-structured and unstructured data and it enables Schema on reading approach

2. SCALABILITY & COST:

Hadoop Framework is designed to process a large volume of data. Whenever the size of data increases, a number of additional resources like data node can be added to the cluster very easily than the traditional approaching of static memory allocation. Time and Budget is relatively very less for implementing them and also Hadoop provides Data Locality where the data is made available in the node that executed the job

3. FAULT TOLERANCE:

In the traditional RDBMS, when data is lost due to corruption or any network issue, it takes more time, cost and resource to get back the lost data. But, Hadoop has a mechanism where the data has minimum three level of replication factor for the data that are stored in HDFS. If one of the data nodes that hold data gets failed, data can be easily pulled from other data nodes with high availability of data. Hence makes the data readily available to user irrespective of any failure.

4. FUNCTIONAL PROGRAMMING:

Hadoop supports writing functional programming in languages like java,scala,and python.For any application that requires any additional functionality can be implemented by registering UDF (User Defined Functions) in the HDFS.

In RDBMS, there is no possibility of writing UDF and this increases the complexity of writing SQL.

Moreover the data stored in HDFS can be accessed by all the ecosystem of Hadoop like Hive,Pig,Sqoop and HBase. So, if the UDF is written it can be used by any of the above mentioned application.

5. OPTIMIZATION:

Hadoop stores data in HDFS and Process though Map Reduce with huge optimization techniques.

The most popular techniques used for handling data are using partitioning and bucketing of the data stored.

Partitioning is an approach for storing the data in HDFS by splitting the data based on the column mentioned for partitioning. When the data is injected or loaded into HDFS, it identifies the partition column and pushes the data into the concerned partition directory. So the query fetches the result set by directly fetching the data from the partitioned directory. This reduces the whole table scan, improves the response time and avoids latency.

Another approach is called Bucketing of the data. This enables the analyst to easily distribute the data among the data nodes. All nodes will have an equal number of data distributed. The bucketing column is selected in such a way that it has the least number of cardinality. These approaches are not available in the Traditional method of SQL.

6. DATA TYPE:

In a traditional approach, the datatype supported are very limited. It supports only structured data. Thus to clean and format the schema of data itself will take more time.

But, Hadoop supports complex data type like Array, Struct, and Map. This encourages using the different kinds of a dataset to be used for data load. For Ex: the XML data can be loaded by defining the data with XML elements containing complex data type.

7. DATA COMPRESSION:

There are very less inbuilt compression techniques available for the traditional database system. But for the Hadoop framework, there are many compression techniques like gzib, bzip2, LZO and snappy.

The default compression mode is LZ4. Compression techniques help in making the tables to occupy very less space increase the throughput and faster query execution.

Data in the Warehouse and Data in Hadoop

In our experience, traditional warehouses are mostly ideal for analyzing structured data from various systems and producing insights with known and relatively stable measurements. On the other hand, we feel a Hadoop-based platform is well suited to deal with semi structured and unstructured data, as well as when a data discovery process is needed. That isn’t to say that Hadoop can’t be used for structured data that is readily available in a raw format.

Processing Structured Data is something that your traditional database is already very good at. After all, structured data, by definition is easy to enter, store, query and analyze. It conforms nicely to a fixed schema model of neat columns and rows that can be manipulated by Structured Query Language (SQL) to establish relationships. As such, using Hadoop to process semi-structured and unstructured data is raw, complex, and pours in from multiple sources such as emails, text documents, videos, photos, social media posts, Twitter feeds, sensors and click streams.

Storing, managing and analyzing massive volumes of semi-structured and unstructured data is what Hadoop was purpose-built to do. Hadoop as a Service provides a scalable solution to meet ever-increasing data storage and processing demands that the data warehouse can no longer handle. With its unlimited scale and on-demand access to compute and storage capacity. Hadoop as a Service is the perfect match for big data processing.

Keeping costs down is a concern for every business And traditional relational databases are certainly cost effective. If you are considering adding Hadoop to your data warehouse, it’s important to make sure that your company’s big data demands are genuine and that the potential benefits to be realized from implementing Hadoop will outweigh the costs.

Shorter time-to-insight necessitates interactive querying via the analysis of smaller data sets in near or real-time. And that’s a task that the data warehouse has been well equipped to handle. However, thanks to a powerful Hadoop processing engine called Spark, Hive,Hbase ..

How to install Hadoop 3.3.0 on ubuntu using shell script

Installing Hadoop:

Installation of Hadoop can be done in 3 different modes:

Stand alone mode - Single Node Cluster
Pseudo distributed mode - Single Node Cluster
Distributed mode - Multi Node Cluster

Local Mode or Stand alone Mode :

The default mode run by the Hadoop is the stand alone mode.
This mode is majorly used in case of debugging where HDFS will not be used.
In this mode both the input and the output can be used as local file system.
No Custom configuration required for mapred-site.xml, hdfs-site.xml and core-site.xml.
This mode is the fastest modes in Hadoop as the local file system is used for both the input and the output.

Pseudo-distributed Mode

The pseudo-distribute mode is also defined as a single-node cluster in which the NameNode and DataNode will exist in the same machine.
Configuration files such as mapred-site.xml, hdfs-site.xml and core-site.xml are required.
If all the Hadoop daemons will be executing on a single node then it is called pseudo-distributed mode.
Both Master and Slave nodes are on the same machine.
A separate Java Virtual Machine (JVM) is created for all Hadoop components and these components communicate across network sockets. Due to this a fully functioning and optimized mini-cluster on a single host will produced effectively.

Fully-Distributed Mode

In this mode multiple nodes will executed so that it is called production mode of Hadoop. This is also called Multi-Node Cluster.
Data is distributed across several nodes on a Hadoop cluster.
In this mode Master and Slave services will be running on the separate nodes

Note: We need to run this shell script after once we access root access only

How to install Hadoop 3.3.0 on ubuntu using shell script

#!/bin/bash
user()
{
    uid=`id -u`
    if [ $uid -eq "0" ]; then
        echo " "
    else
        exit
   fi
}
echo -n "Enter new group for hadoop user:"
read hdgroup
echo -n "Enter username for hadoop user:"
read hduser
echo "Adding user to group"
sudo addgroup $hdgroup
sudo adduser -ingroup $hdgroup $hduser
sleep 10
echo "$hdgroup is created and $hduser is assigned to the group"
updates_check() {
echo "checking for updates please wait..."
sudo apt-get update && apt-get upgrade -y && apt-get install ssh
}
java_version_check() {
echo "Checking for supported java version please wait..."
java_version=$(java -version 2>&1 > /dev/nul | grep version | awk '{print substr($3,4, length($3)-9);}'| tr -d ".")
if [ $java_version -eq "8" ];then
echo "system has 8 installed and supported for hadoop"
java_home=$(which java)
path=$(readlink -f $java_home | cut -c 1-33)
echo $path
else
java_installation_check
fi
}
java_installation_check() {
sudo apt-get remove java-common
echo "please wait.."
sudo apt install openjdk-8-jdk -y
java_version_check
}
ssh_keys_creation(){
sudo -u $hduser ssh-keygen -t rsa -P ""
sudo -u $hduser cat /home/$hduser/.ssh/id_rsa.pub >> /home/$hduser/.ssh/authorized_keys
sleep 10
echo "ssh Keys created"
}
hd_install() {
download() {
wget http://archive.apache.org/dist/hadoop/core/hadoop-3.3.0/hadoop-3.3.0.tar.gz
sleep 2
tar xvfz hadoop-3.3.0.tar.gz
sleep 2
mv hadoop-3.3.0 /home/$hduser/hadoop
}
if [ -f "/home/$hduser/hadoop-3.3.0.tar.gz" ]; then
echo "Download Already exists... using the existing......."
sleep 2
tar xvfz hadoop-3.3.0.tar.gz
sleep 5
echo "Existing Dir deleted..."
rm -rf /home/$hduser/hadoop
sleep 5
mv hadoop-3.3.0 /home/$hduser/hadoop
else
download
fi
chown -R $hduser:$hdgroup /home/$hduser/hadoop
#tmp folder for furthur process
sudo -u $hduser mkdir -p /home/$hduser/hadoop/app/hadoop/tmp
sudo -u $hduser chown -R $hduser:$hdgroup /home/$hduser/hadoop/app/hadoop/tmp
#namenode and datanode
sudo -u $hduser mkdir -p /home/$hduser/hadoop/hadoop_store/hdfs/namenode
sudo -u $hduser mkdir -p /home/$hduser/hadoop/hadoop_store/hdfs/datanode
#changing owner to name and data node
sudo -u $hduser chown -R $hduser:$hdgroup /home/$hduser/hadoop/hadoop_store
#permission to .bashrc,hadoop-env.sh,core-site.xml,mapred.site.xml,hdfs-site.xml,yarn-site.xml
sudo -u $hduser chmod o+w /home/$hduser/.bashrc
sudo -u $hduser chmod o+w /home/$hduser/hadoop/etc/hadoop/hadoop-env.sh
sudo -u $hduser chmod o+w /home/$hduser/hadoop/etc/hadoop/core-site.xml
sudo -u $hduser chmod o+w /home/$hduser/hadoop/etc/hadoop/mapred-site.xml
sudo -u $hduser chmod o+w /home/$hduser/hadoop/etc/hadoop/hdfs-site.xml
sudo -u $hduser chmod o+w /home/$hduser/hadoop/etc/hadoop/yarn-site.xml

echo "export JAVA_HOME=$path" >> /home/$hduser/hadoop/etc/hadoop/hadoop-env.sh

#bashrc
echo -e '\n\n #Hadoop Variable START \n export HADOOP_HOME=/home/'$hduser'/hadoop \n export HADOOP_INSTALL=$HADOOP_HOME \n export HADOOP_MAPRED_HOME=$HADOOP_HOME \n export HADOOP_COMMON_HOME=$HADOOP_HOME \n export HADOOP_HDFS_HOME=$HADOOP_HOME \n export YARN_HOME=$HADOOP_HOME \n export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native \n export PATH=$PATH:$HADOOP_HOME/sbin/:$HADOOP_HOME/bin \n export HADOOP_OPTS=-Djava.library.path=$HADOOP_HOME/lib/native \n #Hadoop Variable END\n\n' >> /home/$hduser/.bashrc
source /home/$hduser/.bashrc
#core-site.xml
sudo sed -i '/<configuration>/a <property>\n\t\t<name>hadoop.tmp.dir</name>\n\t\t<value>/home/'$hduser'/hadoop/app/hadoop/tmp</value>\n</property>\n<property>\n\t\t<name>fs.default.name</name>\n\t\t<value>hdfs://localhost:9000</value>\n</property>' /home/$hduser/hadoop/etc/hadoop/core-site.xml
#mapred-site.xml
sudo sed -i '/<configuration>/a <property>\n\t\t <name>mapreduce.framework.name</name>\n\t\t <value>yarn</value>\n</property>' /home/$hduser/hadoop/etc/hadoop/mapred-site.xml
#hdfs-site.xml
sudo sed -i '/<configuration>/a <property>\n\t\t<name>dfs.data.dir</name>\n\t\t<value>/home/'$hduser'/hadoop/dfsdata/namenode</value>\n</property>\n<property>\n\t\t<name>dfs.data.dir</name>\n\t\t<value>/home/'$hduser'/hadoop/dfsdata/datanode</value>\n</property>\n\t\t<property><name>dfs.replication</name>\n\t\t<value>1</value>\n</property>' /home/$hduser/hadoop/etc/hadoop/hdfs-site.xml
#yarn-site.xml
sudo sed -i '/<configuration>/a <property>\n\t\t<name>yarn.nodemanager.aux-services</name>\n\t\t<value>mapreduce_shuffle</value>\n</property>\n<property>\n\t\t<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>\n\t\t<value>org.apache.hadoop.mapred.ShuffleHandler</value>\n</property>\n\t\t<property><name>yarn.resourcemanager.hostname</name>\n\t\t<value>127.0.0.1</value>\n</property>\n\t\t<property>\n\t\t<name>yarn.acl.enable</name>\n\t\t<value>0</value>\n</property>\n\t\t<property>\n\t\t<name>yarn.nodemanager.env-whitelist</name>\n\t\t<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>\n</property>' /home/$hduser/hadoop/etc/hadoop/yarn-site.xml
#revoking permissions
sudo -u $hduser chmod o-w /home/$hduser/hadoop/etc/hadoop/hadoop-env.sh
sudo -u $hduser chmod o-w /home/$hduser/hadoop/etc/hadoop/core-site.xml
sudo -u $hduser chmod o-w /home/$hduser/hadoop/etc/hadoop/mapred-site.xml
sudo -u $hduser chmod o-w /home/$hduser/hadoop/etc/hadoop/hdfs-site.xml
sudo -u $hduser chmod o-w /home/$hduser/hadoop/etc/hadoop/yarn-site.xml
#hadoop dir
sudo ls /home/$hduser/hadoop
#ssh
sudo -u hduser ssh localhost
}
user
updates_check
java_version_check
ssh_keys_creation
hd_install

After you installed the hadoop using above shell script now identity the path hadoop/bin and then type the following command

hadoop namenode -format

After formatting the namenode ,now identity the path hadoop/sbin and then type the following command

./start-all.sh

./start-dfs.sh

./start-yarn.sh

After starting all the 5 daemons on the said path hadoop/sbin now type the following command

jps (Java Virtual Machine Process Status)

The pseudo-distribute mode is also defined as a single-node cluster in which the NameNode and DataNode will exist in the same machine.

Thursday, 26 January 2023

Patterns For Big data Development

1. IT for IT Log Analytics:

Log analytics is a common use case for an inaugural Big Data project. We like to refer to all those logs and trace data that are generated by the operation of your IT solutions as data exhaust. IT departments need logs at their disposal, and today they just can’t store enough logs and analyze them in a cost-efficient manner, so logs are typically kept for emergencies and discarded as soon as possible. Another reason why IT departments keep large amounts of data in logs is to look for rare problems. It is often the case that the most common problems are known and easy to deal with, but the problem that happens "only once in a while is typically more difficult to diagnose and prevent from occurring again".

The nature of these logs is semi structured and raw, so they aren’t always suited for traditional database processing. In addition, log formats are constantly changing due to hardware and software upgrades.

Log analytics is actually a pattern that IBM established after working with a number of companies, including some large financial services sector (FSS) companies. We’ve seen this use case come up with quite a few customers since for that reason, we’ll call this pattern IT for IT.

NOTE:

step 1: create a log file using notepad and save it
.LOG
22:25 27/10/2014
Today's last entry

step 2: now double click the created log file
.LOG
22:25 27/10/2014
Today's last entry

22:26 27/10/2014

2. The Fraud Detection Pattern:

Fraud detection comes up a lot in the financial services and you’ll find it in any sort of claims or transaction based on online auctions,insurance-claims,environment etc.

Traditionally, in fraud cases, samples and models are used to identify customers that characterize a certain kind of profile. In our customer experiences, we estimate that only 20 percent (or maybe less) of the available information that could be useful for fraud modeling is actually being used. You can use BigInsights to provide an elastic and cost-effective repository to establish the remaining 80 percent of the information is useful for fraud modeling.

Typically, fraud detection works after a transaction gets stored only to get pulled out of storage and analyzed. As we can see that 80 percent of the data is costeffective and efficient in using the BigInsights platform. Now, of course, once you have your fraud models built, you’ll want to put them into action to try and prevent the fraud in the first place. Recovery rates for fraud are dismal in all industries, so it’s best to prevent it versus discover it and try to recover the funds post-fraud.

Think about fraud in health care markets (health insurance fraud, drug fraud, medical fraud, and so on) and the ability to get in front of insurer and government fraud schemes. when the Federal Bureau of Investigation (FBI) estimates that health care fraud costs U.S. taxpayers over $60 billion a year. Think about fraudulent online product or ticket sales, money transfers, swiped banking cards, and more you can see that the applicability of this usage pattern is extreme.

Note:

1.Traditional fraud detection methods with out big data

2.Traditional fraud detection methods with big data

3. Big Data and the Energy Sector:

The energy sector provides many Big Data use case challenges in how to deal with the massive volumes of sensor data from remote installations. Many companies are using only a fraction of the data being collected, because they lack the infrastructure to store or analyze the available scale of data.

Take for example a typical oil drilling platform that can have 20,000 to 40,000 sensors on board. All of these sensors are streaming data about the health of the oil rig, quality of operations, and so on. Not every sensor is actively broadcasting at all times, but some are reporting back many times per second. Now take a guess at what percentage of those sensors are actively utilized.

The location chosen to install and operate a wind turbine can obviously greatly impact the amount of power generated by the unit, as well as how long it’s able to remain in operation. To determine the optimal placement for a wind turbine, a large number of location-dependent factors must be considered, such as temperature, precipitation, wind velocity, humidity, atmospheric pressure, and more.

4.The Social Media Pattern:

The data produced by social media is very huge and if we see the report of Visual Capitalist then it shows that every internet minute:

701,389 logins on Facebook

150 million emails sent

$203,596 in sales on Amazon.com

120+ new Linkedin accounts

347,222 tweets on Twitter

2.4 million search queries on Google

2.78 million video views on YouTube

20.8 million messages on WhatsApp

5. The Call Center Mantra: <This Call May Be Recorded for Quality Assurance Purposes>:

Call center analytics is the collection, measurement, and reporting of performance metrics within a contact center. It tracks call data and agent performance handling inbound or outbound calls. Common types of analytics include handle time, call volume, customer satisfaction, and hold time.

In most cases, call center supervisors can access this data using specialized analytics tools. However, accessing this call center data analytics is often limited to supervisors and team leads. More modern contact centers provide this real-time data to agents so they can mind increasing call volumes.

However, with the right tools and strategy, call data helps you provide exceptional customer experience, boost brand loyalty, and improve efficiency across the board.

6. Risk:

The risks of Big Data are manifold, and organizations need to carefully plan for their use of Big Data solutions. These risks include strategic and business risks, such as operational impacts and cost overruns, as well as technical risks, such as data quality and security.

Here are the five biggest risks of Big Data projects

Security
Privacy
Costs
Bad Analytics
Bad Data

Monday, 9 January 2023

Patterns for Big Data Designs

These big data design patterns aim to reduce complexity, boost the performance of integration and improve the results of working with new and larger forms of data.the common big data design patterns based on various data layers such as data sources and ingestion layer, data storage layer and data access layer.

Data sources and ingestion layer:

Enterprise big data systems face a variety of data sources with non-relevant information (noise) alongside relevant (signal) data. Noise ratio is very high compared to signals, and so filtering the noise from the pertinent information, handling high volumes, and the velocity of data is significant. This is the responsibility of the ingestion layer.

The common challenges in the ingestion layers are as follows:

Multiple data source load and prioritization
Ingested data indexing and tagging
Data validation and cleansing
Data transformation and compression

The above diagram depicts the building blocks of the ingestion layer and its various components. We need patterns to address the challenges of data sources to ingestion layer communication that takes care of performance, scalability, and availability requirements.

Multisource extractor

Multidestination

Protocol converter

Just-in-time (JIT) transformation

Real-time streaming pattern

Multisource extractor

An approach to ingesting multiple data types from multiple data sources efficiently is termed a Multisource extractor. Efficiency represents many factors, such as data velocity, data size, data frequency, and managing various data formats over an unreliable network, mixed network bandwidth, different technologies , and systems.

The multisource extractor system ensures high availability and distribution. It also confirms that the vast volume of data gets segregated into multiple batches across different nodes. The single node implementation is still helpful for lower volumes from a handful of clients, and of course, for a significant amount of data from multiple clients processed in batches. Partitioning into small volumes in clusters produces excellent results.

Data enrichers help to do initial data aggregation and data cleansing. Enrichers ensure file transfer reliability, validations, noise reduction, compression, and transformation from native formats to standard formats. Collection agent nodes represent intermediary cluster systems, which helps final data processing and data loading to the destination systems.The following are the benefits of the multisource extractor:

Provides reasonable speed for storing and consuming the data
Better data prioritization and processing
Drives improved business decisions
Decoupled and independent from data production to data consumption
Data semantics and detection of changed data
Scaleable and fault tolerance system

Multidestination pattern:

The multidestination pattern is considered as a better approach to overcome all of the challenges mentioned previously. This pattern is very similar to multisourcing until it is ready to integrate with multiple destinations (refer to the following diagram). The router publishes the improved data and then broadcasts it to the subscriber destinations (already registered with a publishing agent on the router). Enrichers can act as publishers as well as subscribers:

The following are the benefits of the multidestination pattern:

Highly scalable, flexible, fast, resilient to data failure, and cost-effective
Organization can start to ingest data into multiple data stores, including its existing RDBMS as well as NoSQL data stores
Allows you to use simple query language, such as Hive and Pig, along with traditional analytics
Provides the ability to partition the data for flexible access and decentralized processing
Possibility of decentralized computation in the data nodes
Due to replication on HDFS nodes, there are no data regrets
Self-reliant data nodes can add more nodes without any delay

Protocol converter:

This is a mediatory approach to provide an abstraction for the incoming data of various systems. The protocol converter pattern provides an efficient way to ingest a variety of unstructured data from multiple data sources and different protocols.

The message exchanger handles synchronous and asynchronous messages from various protocol and handlers as represented in the following diagram. It performs various mediator functions, such as file handling, web services message handling, stream handling, serialization, and so on:

In the above protocol converter pattern, the ingestion layer holds responsibilities such as identifying the various channels of incoming events, determining incoming data structures, providing mediated service for multiple protocols into suitable sinks, providing one standard way of representing incoming messages, providing handlers to manage various request types, and providing abstraction from the incoming protocol layers.

Just-In-Time (JIT) transformation pattern:

The JIT transformation pattern is the best fit in situations where raw data needs to be preloaded in the data stores before the transformation and processing can happen. In this kind of business case, this pattern runs independent preprocessing batch jobs that clean, validate, corelate, and transform, and then store the transformed information into the same data store (HDFS/NoSQL); that is, it can coexist with the raw data:

Please note that the data enricher of the multi-data source pattern is absent in this pattern and more than one batch job can run in parallel to transform the data as required in the big data storage, such as HDFS, Mongo DB, and so on.

Real-time streaming pattern:

The real-time streaming pattern suggests introducing an optimum number of event processing nodes to consume different input data from the various data sources and introducing listeners to process the generated events (from event processing nodes) in the event processing engine:

Event processing engines (event processors) have a sizeable in-memory capacity, and the event processors get triggered by a specific event. The trigger or alert is responsible for publishing the results of the in-memory big data analytics to the enterprise business process engines and, in turn, get redirected to various publishing channels (mobile, CIO dashboards, and so on).

BigData

Pages