Pages

Monday, 9 January 2023

Patterns for Big Data Designs

These big data design patterns aim to reduce complexity, boost the performance of integration and improve the results of working with new and larger forms of data.the common big data design patterns based on various data layers such as data sources and ingestion layer, data storage layer and data access layer.

Data sources and ingestion layer:
Enterprise big data systems face a variety of data sources with non-relevant information (noise) alongside relevant (signal) data. Noise ratio is very high compared to signals, and so filtering the noise from the pertinent information, handling high volumes, and the velocity of data is significant. This is the responsibility of the ingestion layer. 
The common challenges in the ingestion layers are as follows:
  • Multiple data source load and prioritization
  • Ingested data indexing and tagging
  • Data validation and cleansing
  • Data transformation and compression
The above diagram depicts the building blocks of the ingestion layer and its various components. We need patterns to address the challenges of data sources to ingestion layer communication that takes care of performance, scalability, and availability requirements.
Multisource extractor
Multidestination
Protocol converter
Just-in-time (JIT) transformation
Real-time streaming pattern

Multisource extractor
An approach to ingesting multiple data types from multiple data sources efficiently is termed a Multisource extractor. Efficiency represents many factors, such as data velocity, data size, data frequency, and managing various data formats over an unreliable network, mixed network bandwidth, different technologies , and systems.

The multisource extractor system ensures high availability and distribution. It also confirms that the vast volume of data gets segregated into multiple batches across different nodes. The single node implementation is still helpful for lower volumes from a handful of clients, and of course, for a significant amount of data from multiple clients processed in batches. Partitioning into small volumes in clusters produces excellent results.

Data enrichers help to do initial data aggregation and data cleansing. Enrichers ensure file transfer reliability, validations, noise reduction, compression, and transformation from native formats to standard formats. Collection agent nodes represent intermediary cluster systems, which helps final data processing and data loading to the destination systems.The following are the benefits of the multisource extractor:
  • Provides reasonable speed for storing and consuming the data
  • Better data prioritization and processing
  • Drives improved business decisions
  • Decoupled and independent from data production to data consumption
  • Data semantics and detection of changed data
  • Scaleable and fault tolerance system
Multidestination pattern:
The multidestination pattern is considered as a better approach to overcome all of the challenges mentioned previously. This pattern is very similar to multisourcing until it is ready to integrate with multiple destinations (refer to the following diagram). The router publishes the improved data and then broadcasts it to the subscriber destinations (already registered with a publishing agent on the router). Enrichers can act as publishers as well as subscribers:
The following are the benefits of the multidestination pattern:
  • Highly scalable, flexible, fast, resilient to data failure, and cost-effective
  • Organization can start to ingest data into multiple data stores, including its existing RDBMS as well as NoSQL data stores
  • Allows you to use simple query language, such as Hive and Pig, along with traditional analytics
  • Provides the ability to partition the data for flexible access and decentralized processing
  • Possibility of decentralized computation in the data nodes
  • Due to replication on HDFS nodes, there are no data regrets
  • Self-reliant data nodes can add more nodes without any delay 
Protocol converter:
This is a mediatory approach to provide an abstraction for the incoming data of various systems. The protocol converter pattern provides an efficient way to ingest a variety of unstructured data from multiple data sources and different protocols.

The message exchanger handles synchronous and asynchronous messages from various protocol and handlers as represented in the following diagram. It performs various mediator functions, such as file handling, web services message handling, stream handling, serialization, and so on:
In the above protocol converter pattern, the ingestion layer holds responsibilities such as identifying the various channels of incoming events, determining incoming data structures, providing mediated service for multiple protocols into suitable sinks, providing one standard way of representing incoming messages, providing handlers to manage various request types, and providing abstraction from the incoming protocol layers.

Just-In-Time (JIT) transformation pattern:
The JIT transformation pattern is the best fit in situations where raw data needs to be preloaded in the data stores before the transformation and processing can happen. In this kind of business case, this pattern runs independent preprocessing batch jobs that clean, validate, corelate, and transform, and then store the transformed information into the same data store (HDFS/NoSQL); that is, it can coexist with the raw data:

Please note that the data enricher of the multi-data source pattern is absent in this pattern and more than one batch job can run in parallel to transform the data as required in the big data storage, such as HDFS, Mongo DB, and so on.

Real-time streaming pattern:
The real-time streaming pattern suggests introducing an optimum number of event processing nodes to consume different input data from the various data sources and introducing listeners to process the generated events (from event processing nodes) in the event processing engine:
Event processing engines (event processors) have a sizeable in-memory capacity, and the event processors get triggered by a specific event. The trigger or alert is responsible for publishing the results of the in-memory big data analytics to the enterprise business process engines and, in turn, get redirected to various publishing channels (mobile, CIO dashboards, and so on).

No comments:

Post a Comment

Friends-of-friends-Map Reduce program

Program to illustrate FOF Map Reduce: import java.io.IOException; import java.util.*; import org.apache.hadoop.conf.Configuration; import or...