Introduction
According to Wikipedia,
A pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements.
Data pipelines as of today are an essential part of the Big Data world. While some challenges to large data stores typically in the data warehousing systems have carried on, the big data world because of its streaming data architectures and near real-time analytics and visualization capabilities has given rise to some interesting data pipeline implementations. A lot of thought has gone into it and we try to summarize it in this article below.
Data Pipelines - History and Origins
The concept of a pipeline began from the good old Unix “Pipe” symbol (|). What was the symbol used for? For sending output of one command to another on the command line. Effectively speaking, it was the output of one “process” (on left side of the pipe) to be given as “input” to another process (which was on the right side of the pipe). The underlying concept was that of a pipe and filter.
“Input (Source) » Pipe » Filter » Pipe » Filter » Output (Sink)”
Pipes are connectors which send data from one component (filter) to another.
Filters do actual data “processing” (transformation/cleansing/scrubbing/munging… whatever)
Input or Source is the actual data source (database output/text file/SQL resultset/raw text)
Output or Sink is the final output at the end of this chain.
However, this simple architecture has its own drawbacks. How do you handle data or memory overflow errors? What if the filter process died in between? You see the room for errors here…?
Data Pipelines in the Big Data World
The world has moved on from there and as of now, with the rise of “big data”, developers talk in terms of data pipelines. We often hear keywords like data pipeline, analytics pipeline, process pipeline and nowadays, “big data” pipelines, how do we differentiate between them? Let us begin with the basic concept of a data pipeline.
A data pipeline involves intermediate tasks, which encapsulates a process. Data pipelines open up the possibility of creating “workflows” which can help reuse, modularize and componentize data flows. As complexities increase, tasks can now be made to work in series or in parallel as part of the workflow.
In a nutshell, data pipelines perform these operations:
- Store schema and access information of different data sources
- Extract discrete data elements of the source data
- Correct errors in data elements extracted from source data
- Standardize data in data elements based on field type
- Transform data using rules-driven framework
- Copy extracted, standardized and corrected data from a source data source to destination data source
- Join or merge (in a rule-driven way) with other data sources
- Move and merge this data (on-demand basis) and save it to a storage system typically called a data lake
Data Pipeline Components
A data pipeline would typically include:
Serialization Frameworks
There are plenty of serialization frameworks in the market like Protocol Buffers, Avro, Thrift etc in the open-source world. Systems (in this case data pipelines) need to be able to serialize and de-serialize data when sending from a data source to a destination data source. The systems should understand a consistent data packing and unpacking mechanism.
Message Bus
Message buses which are typically rules-based routing systems are the actual workhorses of the data pipeline, helping move chunks of data (sometimes at blazing speeds) from one system to another. The Bus can understand multiple protocols, serialization mechanisms and intelligently route data between systems using a rules engine.
Event Processing Frameworks
Data pipelines need event processing frameworks to identify trigger events which help identify and then generate necessary data. This data which needs to be routed to systems using a rule-driven framework (typically inside a message bus), typically is identified by events generated by “event frameworks”.
Workflow Management Tools
These are different from the rules-based routing frameworks working inside a message bus. Typically these are “orchestrating” or “choreographing” systems which supervise the processes which run inside your data pipelines
Storage Layer
These are file systems or data storage systems (called the persistence layer), which allow data to be saved. The data could be entering the storage layer in a stream form or a batch form.
Query Layer
This is the layer where queries are made on the storage layer.
Storage layers nowadays typically support polyglot persistence. Polyglot persistence refers to a persistence architecture which uses multiple database instances, where the instances could be of different types. There is flexibility in terms of using a mix of languages to query the databases and merge the results (using mapreduce etc) to obtain the data expected from the query output.
Tools which allow this kind of querying are Apache Hive, Spark SQL, Amazon Redshift, Presto,
Analytics Layer
This is the layer where actual analytical models are made from the data obtained from the query layer. In the analytics layer, the parameters/variables for the predictive model are extracted and tuned. These variables and their parameters may change as the model changes with new data being ingested. It is in the analytics layer where algorithms like k-means, random forests and decision trees are used to create machine learning models.
What are Stream Operators, Data Nodes, Scheduling Pipelines, and Actions in a Data Pipeline?
Before we understand the concept of stream operators, we must understand field names field values and records. Record is an immutable data structure containing data of field types. Each type has a name and a value. A data pipeline consists of data readers and data writers. The data reader streams record data into the pipeline by extracting the data from source data location. Data writers stream record data out of the pipeline and into the data target locations. Stream operators select using filters that data which needs to be transformed or altered or enriched when in flow inside a pipeline. Stream operators could implement the decorator pattern on top of readers and writers to provide them such filtering capabilities.
A data node is typically a data location storing a certain type of data. The pipeline uses the data node as an input source data or an output target data location.
Pipelines are logical software components or artifacts which can be created, viewed, edited, cloned, tagged, deactivated or deleted on-demand basis. Scheduling a pipeline means creating a Scheduled event or scheduling the creation of a pipeline.
Activities in a data pipeline are work components defined to be performed as part of work for the data pipeline. This could include data transfer activities, execution of custom scripts, querying or transformation of data inside a data pipeline.
Data pipeline could have preconditions which are conditional statements which determine if an activity should be executed or not. These conditions could be scripted or be defined using some rules engine.
Actions in a data pipeline are steps to be taken after a successful or an unsuccessful execution of an activity. This means sending notifications or messages to the intended recipients who need to be notified in case of a successful, unsuccessful or exception having been reached. The job is to alert the intended parties to notify the final activity completion status. These could trigger further activities in turn which would have been defined as part of the data pipeline.
What are the Data Pipeline Architectural Patterns?
Data pipeline is an architectural pattern which defines software components in big data through which data flows in a combination of stages which includes data acquisition, processing, transformation, storage, querying and analytics.
Some of the architectural patterns which have become popular over time are:
Lambda Architecture
Lambda architecture is a data processing architecture which takes advantage of both batch and stream processing methods to provide wild comprehensive and accurate views. The Lambda architecture splits data stream into two streams. One goes for batch processing called the batch layer and another stream for real-time stream processing called the speed layer. Batch layer has massive amounts of data and hence can provide accurate insights when generating views since it contains “all” data over eternity. Speed layer contains most recent data which provides inputs to the “view layer” also called the serving layer. Since the speed layer has most recent data, its accuracy is lesser but this is an architectural tradeoff where you sacrifice accuracy for speed. The serving layer contains a “joined” output of the batch layer and speed layer which respond to ad-hoc queries. This layer uses fast no-SQL databases for generating quick pre-computed or ‘on-demand’ computed views.
The downside of a Lambda architecture is the need to maintain two different layers of data - the speed and batch layer which can increase infrastructural costs and development complexity since both layers need to be “joined” for the “serving layer” to provide useful outputs.
Typical examples of the batch layer are Apache Hadoop.
Examples for the speed layer are Apache Spark, Storm, Kafka and SQLStream.
Examples of the serving layer for speed layer output are Elasticsearch (ELK stack), Apache HBase and Cassandra, MongoDB.
Examples of the serving layer for batch layer output are Apache Impala or Hive, ElephantDB.
Kappa Architecture
The lambda architecture had the challenge of maintaining two different storage layers - the speed and batch layer. The kappa architecture solves this by keeping only a single layer called the “speed layer”. This layer - “speed layer” is the only layer on which all processing is done. This architecture typically needs very quick data stream “replay” and “processing” capability. Results of this processing are then kept on the “serving layer” which gets its feed from only one layer - the speed layer. The architecture is simpler in terms of providing only one code base or architectural silo to be maintained. However, the Kappa architecture has its own set of challenges. One of the challenges is that event data may come out of order and “replaying” data for “re-querying” can have additional cost and complexity. Finding duplicate events or which “cross-reference” each other because of their dependency of being part of a larger transaction or “workflow” make processing a stream more complex.
Polyglot Persistence
While this may not look as an “architectural pattern” as per the purists, keeping different types of persistence systems (SQL, NoSQL and NewSQL based) can help solve a data ingestion, processing and pipelining problem. Since different databases are designed to solve different classes of “data problems”, no single “database system” can solve upfront all the problems which “big data” platforms can have. In most cases, architecture evolves as business finds out what new types of data to capture and engineers figure out new ways of capturing it technically. One effective or “quick fix” way is to use polyglot persistence to capture different kinds of data and just “save” it for future usage. Systems which do actual data processing and provide business insights can be built on top of this polyglot persistence layer at a later stage.
How Does a Data Pipeline Handle Errors?
Create an Error Pipeline
Creating an error pipeline and joining it with the parent pipeline is one way of handling errors. “Exceptions” or “errors” or “filters” can be used to identify events or data in the pipeline which can be marked as exceptions. This can be routed to the error pipeline. Care should be taken that parent pipelines after execution also clean up the error pipelines after the errors have been addressed appropriately.
Conditional Timeouts and Troubleshooting
When pipelines are doing the grunt work, their status can be monitored for changes. Keeping a timeout mechanism whereby they can be monitored if they finished their jobs/tasks successfully or not can be used to figure out the final “status”. Their execution details and summary can be utilized to ascertain all “errors” during the job/task execution. Appropriate actions or events can be fired to take care of the error(s) or error data. Depending on the error generated, a whole pipeline can be failed by design.
Error Views
In either case, whether having an error pipeline or handling errors generated in a pipeline while cleaning it up, it is advantageous to have an “error view”. These views can have the original event documents, the exception or error message, event ID (if the document was generated as part of a larger event lifecycle), saga ID, parent process ID, timestamp, source system ID and any other metadata which could be beneficial to identify the event source.
One common error in design often seen is pushing massive amounts of data through a single pipeline leading to larger volumes of error data generated which is challenging to process and troubleshoot. Keeping data pipelines which manage smaller “manageable” tasks has often been found to be a more manageable architecture.
What are the Typical Challenges Implementing a Data Pipeline or Choosing a Data Pipeline Framework?
- A distributed stream processing framework needs performant “In memory computational” capabilities to run rules and filters on the data. This stream should also be immutable to not corrupt the source data.
- Data pipeline frameworks have storage complexities for large data sets. This means performant (disk) IO capabilities especially in a distributed computing environment.
- Stream processing may need both Synchronous and Asynchronous processing capabilities since data may come out of order or may have complex rules-based filtering/processing criteria.
- A data pipeline should have the capability to process data as per schedule or in an on-demand way.
- Streaming data comes from Multiple sources and can get routed to Multiple targets. Data pipeline frameworks should have resilient pub-sub models for complex data routing requirements.
- Since the data itself may have rules of processing and persisting them in series or parallel, frameworks should have the capability of processing in batch/series/parallel
- Any data pipeline framework should allow custom or even complex processing of data. It should have the capability to support rules-based engines or filtering rules which may even have more complex “state management” needs for processing data.
- Data pipelines should be performant whether the needs are compute-intensive or data-intensive.
Are There Any Architectural Differences Between Compute-Intensive vs Data-Intensive Data Pipelines?
The data tunneling needs for a compute vs data-intensive pipeline are being researched by different groups of people. Typical solutions are:
- Using multi-core architectures to scale up computationally,
- Reduce object sharing and increase parallel execution models for high-speed processing of streaming data
- Reduce disk I/O and increase in-memory processing
- Use network accelerators for high-speed data transfer
Which All (Open Source) Frameworks are Used to Implement Data Pipelines?
There is a flood of open-source tools in the market but we will have time only to cover the behemoths among them which are listed down here.
Apache Hadoop
The granddaddy of all data processing frameworks, Hadoop is a collection of open-source projects which together help solve problems involving massive computing data. It uses the MapReduce programming model to rapidly compute queries from data stored on multiple systems by running the queries in parallel and combining their results. Hadoop is built on top of HDFS - Hadoop Distributed File System, which is the distributed storage part of Hadoop and Hadoop MapReduce which is the querying and large-scale data processing part.
Apache Spark
Spark is a “cluster computing framework” which allows the capability to program entire clusters in parallel. Spark includes a cluster manager and distributed file storage system. The core of the system includes Spark Core which provides features like task scheduling, dispatching, APIs based on the RDD (resilient distributed data set) abstraction. The RDDs were in response to the limitations of the MapReduce computing model which came as part of the Apache Hadoop project. The RDD functions’ latency could be reduced by multiple times compared to the MapReduce computation model by allowing repeated database-style querying of data. RDDs which are an abstraction which provide convenience in terms of working with distributed in-memory data. Spark provides an interactive shell (REPL) which is great for development and helps data scientists quickly get productive with their models.
Further components of the Spark framework are:
- Spark Core - An API model based on RDD abstraction which allows functional and higher-order programming model to invoke “map, filter and reduce” functions in parallel. It also allows RDDs
- Spark SQL - A DSL (domain-specific language) which allows manipulation of DataFrames (data resultsets crudely speaking) for structured and semi-structured data
- Spark Streaming - A framework for performing streaming analytics
- Spark MLlib - A distributed machine learning framework considered faster than Vowpal Wabbit and Apache Mahout
- GraphX - A distributed graph processing framework built on top of Spark based on RDDs
Apache Storm
Apache Storm is a distributed stream processing framework written in Clojure language. The framework allows both distributed and batch processing of streaming data. A storm application treats a graph of data pipelines as a DAG (directed acyclic graph) topology. The edges are considered the streams and vertices as data nodes. The spouts and bolts act as the data vertices. A topology is a network made of Streams and Spouts. Spout is the data stream source. Its job is to convert data into tuples of streams and send to bolts on need basis. Stream is an unbounded pipeline of tuples.
Apache Airflow
Apache Airflow was developed at Airbnb and later became a part of the Apache Software Foundation. The official definition of Airflow mentions that it is a platform for programmatically creating, authoring, scheduling, and monitoring workflows. It is a workflow management system which inherently uses data pipelining mechanisms to intelligently route data and create workflows on top of them. Similar to Apache Storm, the entire workflow can be treated like a DAG (directed acyclic graph). Since Airflow has thrown in a rich UI, it becomes easier to create, monitor, visualize and troubleshoot these workflow pipelines. On top of the visualization, there is a rich CLI (command line interface) support to perform complex operations on the DAG.
Apache Beam
Apache Beam is a programming model built to define, create, execute, and monitor data pipelines. It has support for ETL (extract, transform, load) operations, batch operations, and stream processing operations. Apache Beam is one implementation of the Google DataFlow model paper. Once you define the pipeline in Apache Beam, its execution can be done by any of the distributed processing backends like Apache Apex, Flink, Spark, Samza, and even Google Cloud Dataflow.
Apache Flink
Apache Flink is an open-source distributed stream processing framework which executes dataflow programs in both data-parallel and pipelined manner. The pipelines enable execution in both bulk and batch processing as well as in stream processing. It also provides some additional features like event-time processing and state management. Flink does not provide its own storage systems. It provides source and sink connectors to other systems like HDFS, Cassandra, Elasticsearch, Kafka, or even Amazon Kinesis.
Apache Tez
Apache Tez is a framework which allows complex DAG of tasks to run for processing streaming data. It provides a faster implementation engine for MapReduce than those provided by Hadoop using Hive and Pig. Tez provides APIs to define dataflows with flexible input-output runtime models. The dataflow decisions can be made by allowing DAGs to be changed at runtime for optimized performance. While compared to Storm, it may not have the advantage of in-memory dataset processing or having immutable datasets (RDDs) to play around with, its faster performance of MapReduce algorithm allows it to be used with Apache Hadoop and YARN based applications for speed and performance. A comparison of the performance metrics can be seen here.
Apache Samza
Another Apache open-source framework, Samza is a near real-time, asynchronous framework for distributed stream processing. A unique feature of Samza is the idea of immutable streams. Upon receiving a stream, Samza allows creation of immutable streams which when shared with other processing engines does not allow the original stream to get altered or affected in any way. They are immutable. Samza works along with Apache Kafka clusters (called brokers). Like a typical streaming application, Kafka contains topics to which producers “write” data and consumers “read” data from. Samza has been written in Scala and Java and was developed in conjunction with Apache Kafka. Both were developed originally by LinkedIn before becoming a part of the Apache incubation projects.
Apache Kafka
Apache Kafka is an open-source stream processing framework. It has become a popular framework providing unified, high throughput, low latency stream processing platform which works on the principle of a “massively scalable pub/sub model (queue) implemented as a distributed transaction log (also called commit log)”. Kafka provides a Connect API to import and export data from other streams. Kafka Streams API is a stream processing library written in Java which allows development of “stateful” stream processing applications.
Can We Have API-Based Usage of Data Pipelines?
There are systems (mostly as SaaS-based models) like Amazon Pipelines, Parse.ly, Openbridge, etc which allow an API-like usage for data pipelines. All you have to do is to create a pipeline on the fly (declaratively or using some cloud configuration) and start ingesting and routing data. A sample usage is of Amazon Redshift which allows integration with Amazon’s Data Pipelines and SQS services to ingest petabytes of data from multiple sources. All you need is to create the pipeline (programmatically or declaratively) and then consume it like an API.
Enterprise Service Bus (ESB) vs a Data Pipeline - How Are They Different?
As is often said in the programming circles - “Buses don’t fly in the cloud”. When it comes to high throughput data which can run into petabytes, the ESBs don’t scale up. While open-source frameworks like Kafka, Samza, Storm, Spark, and Flink are designed to handle high-speed streaming data, conventional ESBs like ActiveMQ, Mule, etc may not be the right choices to handle such a deluge of streaming data. Scaling up will be a challenge. Even though functionally, both ESBs and data pipelines perform similar tasks of routing event data, allowing rules-based routing and transformation, and allowing pub-sub models for data subscription and publishing.
How Do You Manage Security for Data Pipelines?
Logically all data pipelines reside on the cloud in their physical form. Handling data pipeline security is akin to handling security of any cloud-based infrastructure. This would entail creating right security groups, secure and encrypted communication between cloud instances, securing the server ports, have rigorous authorization and authentication mechanisms for those agents accessing the cloud infrastructure.
Using secure transport layer (SSL/TLS) on the data pipeline is an additional way of securing data streaming between cloud instances.
How Can Data Pipelines Be Debugged?
Debugging streaming data is a tough ask. Marking event data as “healthy” (if it got ingested) or as “error” in case it met an exception is one way of “marking” the source data which is causing exceptions.
Marking “error” data and routing it through the right pipelines or “logging” it to make it accessible for debugging is the right way. Unlike throwing an exception where the actual payload data may have to be figured out, explicitly dumping “unhealthy” data which failed in a pipeline (the way it is mentioned in the section “Error views” above) is the right way to design your system. Capture everything when logging errors.
What Are the Different Data Storage and Modeling Solutions Available When It Comes to Polyglot Persistence in Big Data Solutions?
Common data storage solutions get classified in one of these categories of databases:
- In-memory datastores (HSQL, Redis)
- File system based databases (MySQL, Oracle)
- OLAP or OLTP/TDS (Oracle, Postgres, MySQL)
- Distributed file systems (HDFS etc)
- Data marts or warehouses or master data stores (as traditional warehousing solutions) (Oracle DW)
- ODS or operational data stores (different than traditional data warehouses)
- Data lakes (data from multiple data stores - SQL/NoSQL merged into one massive data lake) (Amazon or Azure data lakes)
Traditionally, the OLAP databases have been used for “bulk” data used for warehousing solutions. OLTP databases are used for event-based or transactional processing on real-time basis.