In the last article, we explored batch processing, a time-tested approach to handling large quantities of data by processing them in discrete, scheduled intervals. Batch processing has been a cornerstone of big data processing systems, serving as the foundation for countless applications and workflows. In this chapter, we will explore the fundamentals of stream processing, its advantages over batch processing, and its impact on system design.
Stream processing is a paradigm shift that focuses on processing data in real-time as it is generated, rather than waiting for it to accumulate in large batches. This approach allows for instant analysis, decision-making, and response to events as they occur, making it particularly well-suited for applications where real-time insights are critical. Examples of such applications include:
- fraud detection
- real-time log analytics
- IoT (Internet of Things)
- online game event streaming
- machine states in industrial automation
- video processing
Key Concepts in Stream Processing
What is a Stream?
A stream refers to data that is incrementally made available over time. This dynamic nature of data streams differentiates them from static data sources, such as batch data, which are processed at discrete intervals. In the context of stream processing, the data is organized into an event stream, which represents an unbounded, incrementally processed counterpart to batch data. Each event in the stream carries information about a specific occurrence, which can be anything from a user's interaction with a website to sensor readings in an IoT system.
How are streams presented and transported?
To transport streams, various messaging systems and data streaming platforms are employed. Apache Kafka, Amazon Kinesis, and Google Cloud Pub/Sub are some of the popular platforms that facilitate stream transportation. These systems provide a distributed, fault-tolerant, and scalable infrastructure for transporting data streams efficiently across various components of a stream processing pipeline.
Windowing
As streams are unbounded, it is often necessary to divide them into smaller, manageable chunks for processing. Windowing is a technique that segments an event stream into finite windows based on time, count, or session. Time-based windows can be further categorized into tumbling windows, sliding windows, and hopping windows. Windowing enables processing tasks like aggregation or pattern matching to be performed on these smaller, well-defined subsets of the stream.
- Tumbling Windows: Tumbling windows divide the event stream into non-overlapping, fixed-size time intervals. When a window is closed, the next one starts immediately without any overlap. Tumbling windows are often used when you need to compute an aggregate value for each window independently, without considering the events from other windows. Example: Assume you want to count the number of requests per minute for a web service. You can create tumbling windows of size 1 minute, and at the end of each minute, compute the request count within that window.
- Sliding Windows: Sliding windows also divide the event stream into fixed-size time intervals but with a specified overlap between consecutive windows. Sliding windows are useful when you need to compute an aggregate value for each window while considering the events from the surrounding windows. Example: Suppose you want to calculate a moving average of stock prices for the last 5 minutes, updated every minute. You can create sliding windows of size 5 minutes with a 1-minute slide. This means that every minute, a new window will be created, and the average stock price will be calculated for that 5-minute window.
- Session Windows: Session windows are based on the activity pattern of the events rather than a fixed time interval. They are used to capture bursts of activity separated by periods of inactivity. Session windows are particularly useful when analyzing user behavior, as they can group events related to a single user session. Example: Imagine an e-commerce website where you want to analyze user interactions within a single shopping session. You can create session windows by grouping events from the same user that are close in time (e.g., within 30 minutes of each other) and separated by a period of inactivity (e.g., no events for at least 30 minutes).
- Global Windows: Global windows treat the entire stream as a single window. They are typically used in conjunction with triggers to determine when to emit the results of the computation. Global windows are useful when you need to process the entire dataset as a whole or when you want to compute a value based on all events in the stream. Example: Suppose you want to find the longest common prefix among all the incoming events in a stream. You can create a global window that accumulates all the events and apply a trigger to emit the result when a certain condition is met, such as the arrival of a special end-of-stream event.
Event Time vs Processing Time
In a perfect world, events are emitted and processed at the same time. However, in reality this is often not the case. There are numerous factors that can cause delays in processing: network issues, delays in message delivery, a stream consumer restart, or reprocessing past events while recovering from an error or after resolving a code bug. Therefore it’s important to define event time and processing time properly.
- Event time: Event time refers to the actual time at which an event occurs in the real world. This timestamp is usually embedded in the event data itself and is used to order events in a stream. Event time is essential for performing time-based analytics and detecting patterns within the stream, ensuring that the analysis is based on the natural order of events as they happened.
- Processing time: Processing time is the time at which an event is processed by the stream processing system. This time may differ from the event time due to factors such as network latency, system delays, or varying rates of incoming events. Processing time is crucial for measuring the performance of a stream processing system and ensuring that it can handle the incoming data rate.
Consider an example where we have a stream processor that measures the rate of requests. Let’s say the processor went offline for a brief period, and when it comes back up it needs to process the backlog of events along with new events. This will introduce an artifact that looks like there is a spike in traffic.
Watermarking
Watermarking is a common method used to address the inconsistency between the time of event occurrence and processing, managing out-of-order events, delayed data, and determining if the data within a specific time period is complete.
In the field of data stream processing, the concept of "watermarking" differs significantly from the watermarking we typically associate with images. Here, a "watermark" is a metadata tag used to handle time information within a real-time data stream.
In real-time data stream processing, data is usually processed according to the time the event occurred, known as event time. However, due to network delays, system failures, or the concurrent nature of data processing, data may not arrive at the processing system in the order of event time. This leads to a problem: how can the processing system know when it is safe to close a time window and perform aggregation calculations without missing any subsequently arriving delayed data?
Watermarking technology offers a solution by allowing the system to estimate the event time within the data stream, defining a point in time at which it is assumed that all data prior to this point has arrived and window aggregation calculations can be performed. If data with an event time later than this watermark arrives, the system may choose to ignore these delayed data, as they fall outside the processing window defined by the watermark.
In the diagram above, the windowed aggregation is emitted every five minutes, as triggered by the trigger mechanism. From the illustration, we can observe that at 08:15
, the aggregation result for the event window spanning 08:00
to 08:05
is released. One might wonder why it is this particular event window and not the one before or after it. This timing is determined by the application of a watermark.
The logic for determining the event window is as follows: take the maximum event time from the processed data and subtract the watermark from it. The event window is then defined as the period leading up to the resulting time. This description might seem a bit convoluted, but it is easier to understand with an example.
In the given figure, the watermark is set to five minutes. At 08:15
, the latest event time is 08:13
. Subtracting the watermark duration gives us 08:08
, which means the event window that ends before 08:08
is the one from 08:00
to 08:05
.
Stream Joins
Stream joins are operations that combine two or more event streams based on specified conditions, similar to relational database joins. These conditions can be based on time or key attributes of the events. For example, a time-based join might merge two streams based on the timestamps of their events, while a key-based join might combine streams based on a common identifier, such as a user ID or device ID. An example is how Google Ads joins impression and click streams using adID to calculate click-through rate (CTR).
Data integration: streams and databases
Data integration plays a crucial role in stream processing, as it involves connecting streams to databases and other data storage systems. This connection allows for seamless data flow between the real-time processing layer and the underlying data storage layer. To facilitate this integration, data streaming platforms often provide connectors or APIs for various databases, such as Apache Cassandra, MySQL, or PostgreSQL. This integration enables stream processing applications to read and write data to and from databases in real-time while maintaining consistency and reliability.
Stream Processing Patterns and Real-world Examples
Stream Processing Pattern | Description | Examples |
---|---|---|
Real-time analytics | Continuous analysis of data to produce real-time metrics and KPIs | Monitoring user engagement on a website, tracking social media sentiment. |
Event-driven systems | Detection and processing of specific events in real-time, triggering appropriate actions | - Fraud detection in financial systems. For example, Stripe uses stream processing to filter payments. |
Real-time notifications in social networks. | ||
Complex event processing (CEP) | Detection of patterns and correlations among multiple event streams to identify higher-level events | Detecting intrusions in network security systems, monitoring complex manufacturing processes |
Data enrichment and transformation | Transformation and enrichment of data as it is generated, adding value to the raw data | Joining a stream of user events with a stream of product information |
Temporal-based pattern detection | Identification of patterns within a specific time frame, useful for analyzing time-series data | Detecting user behavior patterns in a mobile app, monitoring cloud-based service performance |
Stream processing for machine learning | Preprocessing data for machine learning models and enabling real-time prediction and anomaly detection, continuous learning and model adaptation | Preprocessing data for real-time prediction, updating machine learning models with new data. For example, generating a user’s home timeline in Twitter. |
Popular Stream Processing Frameworks
Several popular stream processing frameworks have emerged in recent years, each with its own set of features and benefits. Here, we present a list of popular frameworks along with real-world company examples that utilize these frameworks for their data processing needs.
- Apache Kafka: As we have discussed in details in the log-based message queue section, Apache Kafka is a highly-scalable, distributed, and fault-tolerant stream processing platform that can handle millions of events per second. It is widely used for building real-time data pipelines and streaming applications.
- Real-world example: LinkedIn uses Apache Kafka for handling massive amounts of data, including user activity data, operational metrics, and application logs. Kafka allows LinkedIn to process these data streams in real-time for various purposes, such as analytics, monitoring, and recommendations.
- Apache Flink: Apache Flink is a distributed stream processing framework that provides high-throughput, low-latency, and exactly-once processing semantics. It is designed for stateful computations over unbounded and bounded data streams.
- Real-world example: Alibaba uses Apache Flink for various real-time data processing tasks, including real-time search indexing, real-time recommendation, and large-scale machine learning. Flink enables Alibaba to process billions of events per day with low latency and high reliability.
- Apache Samza: Apache Samza is a distributed stream processing framework that is designed to work with Apache Kafka. It offers strong durability, fault tolerance, and processing guarantees for stateful stream processing applications.
- Real-world example: Uber uses Apache Samza for various real-time data processing tasks, including fraud detection, dynamic pricing, and trip processing. Samza allows Uber to process millions of events per second with low latency and strong processing guarantees.
- Apache Storm: Apache Storm is a distributed real-time computation system that is designed for processing large volumes of high-velocity data. It is easy to set up and operate and offers strong guarantees for data processing.
- Real-world example: Twitter uses Apache Storm for processing large amounts of tweet data in real-time, enabling real-time analytics and trend detection. Storm allows Twitter to process millions of tweets per minute, ensuring timely insights and recommendations.
- Apache Pulsar: Apache Pulsar is a distributed pub-sub messaging platform with a flexible messaging model and strong durability guarantees. It is designed for high-performance streaming use cases and can be used for both real-time data processing and message queueing.
- Real-world example: Verizon Media uses Apache Pulsar for various real-time data processing tasks, including log processing, real-time analytics, and event-driven applications. Pulsar enables Verizon Media to handle billions of events per day with high performance and strong durability guarantees.