Data Processing is critical to any system that you build. Order processing in an eCommerce system, sensor data processing in IoT, or payment transactions processing in finance – all involve data processing in some shape or form. And this could be a simple movement of data from source to sync to involve complex data transformations. Many of these could be classified as either:
- Batch data processing (cold path) or
- Streaming data processing (hot path)
While some systems have to either work on batch data and others on streaming data, there are some systems that need to handle both types of workloads. For example, PeerIslands’ 1Data solution synchronizes data between legacy and microservice databases such as MongoDB and is required to handle both types of workloads – streaming data processing for ongoing data sync and batch/cold data processing for one-time snapshots, and re-processing data for accuracy/changing requirements.
The data processing architectures that cater to these workloads have evolved over a period of time. In this blog, I would like to explore various data processing architecture patterns that help us handle both types of workloads, the fitment of different data processing architectures to use-cases and workloads, pros and cons of each approach, frameworks, and tools available for implementation, and we will also look at how these have been explored in our 1Data solution.
Data Processing Architectures
Before we delve into the characteristics of each type of workload and what is involved in processing each type – let’s look at batch and streaming data.
Characteristics of batch data processing system:
- Able to handle large data volume – terabytes, petabytes, or more
- High latency is acceptable. Parallel processing is typically used to optimize the time taken
- Scheduled lazily to allow for more data to be processed with each run
- Accuracy/correctness over latency of processing
Characteristics of streaming data processing system:
- Systems need to handle the high velocity of incoming data
- Real-time processing of data with low latency
- Scheduled eagerly to process data in real-time
- Support exactly-once processing
- Handle out-of-order data and process them as required
- Low latency and speed over the correctness of results
Usually, two major data processing architecture patterns are used when you need to handle the above workload types.
- Lambda Architecture pattern proposed by Nathan Marz based on his experience working on distributed data processing systems at Backtype and Twitter
- Kappa Architecture pattern proposed by Jay Kreps from LinkedIn
We will look at how each of these evolved, how they work, their pros and cons, and how we can implement them.
Out-of-order data is a typical challenge when it comes to high volume and velocity streaming data. Streaming data processing systems typically handle out-of-order, or slow arriving data using techniques such as:
- Event time windows (tumbling, sliding, session windows)
While these techniques mitigate many of the scenarios, this is not a foolproof solution to handle out-of-order and late-arriving data. In contrast, the batch processing of data is highly accurate given all data required is available before processing starts.
Based on his experience working on distributed data processing systems at Backtype and Twitter, Nathan Marz, creator of Apache Storm, proposed the Lambda Architecture. This pattern enables low latency data processing using streaming data processing. Batch data processing backfills any inaccurate results of stream processing and accounts for any out-of-order/late-arriving data. This improves the accuracy of the overall system.
This architecture pattern has been widely adopted across various domains. Some examples include Yahoo, Netflix, and LinkedIn, among others, who have used the Lambda architecture pattern in a variety of use-cases including ad data processing, user clickthru analysis, IoT sensor data analytics, social media analysis and so on.
A typical implementation of Lambda architecture has been presented below. However, there are multiple toolset combinations that can be used.
While the Lambda architecture pattern provides the low latency of a streaming data processing system and the accuracy of a batch data processing solution, it requires two different codebases to be maintained, each likely written in different languages. This makes it challenging to ensure the business logic is in sync across toolsets and programming languages, and whenever codes change due to a bug or a change in requirement, it becomes especially complex. Additionally, with a Google scale system with fast releases, the complexity multiplies.
While there are frameworks like Summingbird and Lambdoop that can help write business logic in a higher-level language, they build runtime executables specific to the speed or batch layer. However, this is still neither efficient nor elegant.
Jay Kreps, co-creator of Apache Kafka proposed the Kappa architecture to overcome the limitations of Lambda architecture. The basic idea was to get rid of the batch layer, and use stream data processing for both historical/static data and live/streaming data. While the notion of using stream data processing for data at rest seems counterintuitive, the capabilities of current day systems have evolved significantly and help us realize this architecture pattern.
- Immutable append-only log: Data is stored in a system such as Kafka or Delta Lake – an immutable, append-only log of incoming data. Allows for a retention period based on requirement and data characteristics
- Speed: Use stream processing on the latest dataset and update the serving layer
- Accuracy: Only when required. Start another stream of data processing in parallel that processes the entire dataset of interest. Increase the number of parallel processing instances to account for the complete dataset size. This dataset will be more accurate and will supersede the one generated by the speed layer. You can also choose to run this only when required
- Switch the serving layer to point to the newer dataset once processing is complete
While there are a number of implementations of this Kappa architecture, the following represents a typical implementation.
What makes this possible:
- Kafka or Delta lake-type systems that have an immutable append-only storage system and retain data for the desired period of time
- Stream processing systems such as Apache Spark, Apache Beam
- Serving layer such as MongoDB
Apache Beam – typical Kappa architecture implementation
Apache Beam is an open-source, unified model for defining both batch and streaming data-parallel processing pipelines. You can build the processing business logic in any of the support programming languages – Java, Python, Go, and many more. Once you build a data processing pipeline using Apache Beam SDK, you can use a number of supported runners for the pipeline – Apache Spark, Apache Flink, public cloud implementations such as Google Data Flow.
This processing model provides the ideal solution to having a common codebase for processing both streaming and batch data. Its support for multiple programming languages and pipeline runners bode well for adoption into a microservices architecture. Each team can choose their favorite language and runner based on what fits their requirement.
With our 1Data solution that helps sync data between legacy and microservices databases, we began with stream data processing and designed it using Apache Kafka and Apache Spark. We were looking at design using Apache Beam and implement a Kappa Architecture. (For more details on 1Data please read this blog). Apache Beam allows us the ability to have low latency processing for data sync – handle out-of-order and slow arriving data. And when requirements or code changes, we can use the same code and immutable dataset stored in Apache Kafka for reprocessing the data. Apache Beam also gives us the flexibility to experiment with the choice of runners based on the cloud platform we deploy the solution to.
Each of the data processing architecture patterns – Lambda or Kappa has significant applications across domains. With the proliferation of Big Data, there are increasingly complex data processing requirements. And the technology ecosystem has seen significant advancements to keep up with the fast-changing landscape.
Apache Beam came with its write once, in any language, and run using any tool vision, the latest being Apache Hudi open-sourced by Uber engineering.
I see the number of competing solutions/projects increasing exponentially – each one bringing its own set of features and solving a piece of the puzzle in Big Data processing.
- Lambda Architecture » λ lambda-architecture.net (lambda-architecture.net)
- Questioning the Lambda Architecture – O’Reilly
- Kappa Architecture – Where Every Thing Is A Stream (pathirage.org)
- Streaming 101: The world beyond batch – O’Reilly
- Streaming 102: The world beyond batch – O’Reilly
- Designing a Production-Ready Kappa Architecture for Timely Data Stream Processing | Uber Engineering Blog
- Sergei Sokolenko “Advances in Stream Analytics: Apache Beam and Google Data Flow
- Building a Large-scale Transactional Data Lake at Uber Using Apache Hudi | Uber Engineering Blog
- From Lambda Architecture to Kappa Architecture Using Apache Beam | LinkedIn
- Reading Apache Beam Programming Guide — 1. Overview | by Chengzhi Zhao | Medium
- Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry