Interested in stream processing? Sign up for Flink Forward 2015, the first conference on Apache Flink™.
In recent blog posts, we introduced what we deem as requirements for systems to classify as stream processors, and followed up with a detailed comparison of current approaches to data streaming, including extensive experiments comparing Apache Flink™ and Apache Storm.
We are not the only ones to make the point that streaming systems are reaching a point of maturity which makes older batch systems and architectures look less compelling. But is batch dead already? Not quite, it’s just that streaming is a proper superset of batch, and batch can be served equally well, if not better, with a new breed of modern streaming engines:
In this post, we dive deeper on how, in principle, batch analytics is just a special case of streaming analytics, we look concretely on how Apache Flink™ implements this vision.
The principles: Bounding unbounded data with windows
We will follow the excellent terminology from Tyler Akidau’s blog post (and largely build on his thoughts). We use the term unbounded data for an infinite, ever-growing data stream, and the term bounded data for a data stream that happens to have a beginning and an end (data ingestion stops after a while). It is clear that the notion of an unbounded data stream includes (is a superset of) the notion of a bounded data set:
Streaming applications create bounded data from unbounded data using windows, i.e., creating bounds using some characteristic of the data, most prominently based on timestamps of events. For example, one can choose to create a window of all records that belong to the same session (a session being defined of a period of activity followed by a period of inactivity). The simplest form of a window is (when we know that the input is bounded), to include all the data in one window. Let’s call this a “global window”. This way, we have created a streaming program that does “batch processing”:
The efficiency: Pipelined and batch processing with Flink
Early streaming systems suffered from efficiency problems due to design choices that sacrificed throughput, in particular, record-by-record event processing and acknowledgement. This led to a belief that streaming systems can only “complement” batch systems, or that hybrids of streaming and batching (“micro-batching”) are required for efficiency.
This is, however, no longer true. For example, Flink’s network stack is based on a hybrid runtime that supports pipelined processing, similar to MPP databases, as well as batch processing if needed. This style of processing can support the full spectrum of data exchange, from pure record-by-record shipping to pure batch processing. Flink accumulates records in buffers, and ships these buffers over the network when they are full. This style of processing can emulate both record-by-record processing (regard a buffer as “full” when it has one record), pure batch processing (retain all buffers in memory and disk until the result has been fully materialized), and, interestingly, a sweet spot in the middle, which is the default behavior of Flink and can achieve very high throughput.
This leads us to establish that pipelining is a proper superset of pure batching:
Batch systems are streaming systems in the closet
The little secret of batch processors is that they always include a hidden streaming component. When a batch processor reads a file, it streams the file to the first operator. If the first few operators are record-by-record transformations such as filters, mappers, etc, then the data is still streamed through the operator code (often by “chaining” operators together). However, at the first blocking operator (e.g., sort, aggregation, etc) a batch processor blocks the output until all the input of the operator has been produced.
Of course, blocking operators are blocking by nature, so their input needs indeed to be materialized before they start processing it. For example, a sorting operator needs the whole data to be present before it can return a fully sorted result. However, there is no fundamental reason that blocking should be part of the system and not the operator itself. Flink follows the philosophy of streaming end-to-end, by embedding blocking operators within streaming tasks. The system does not differentiate between a blocking operator and a streaming operator, this is part of the operator logic itself:
We have seen that it is possible to embed blocking operators into streaming topologies, and thus, streaming topologies strictly subsume classic batch processing DAGs:
Overall, the result of all these observations brings us to the thesis we started from, namely that batch is a special case of streaming:
Lambda and Kappa architectures, and batch-specific optimizations
If batch is a special case of streaming, can it be that pure batch applications are better served by the special case? Partly, but this can be fully rectified by implementing batch-specific optimizations on top of a stream processor. Let us look at how Flink in fact implements a hybrid architecture, incorporating optimizations that are specific to batch workloads in the framework.
As older streaming systems lacked support for high throughput and consistent results, the so-called Lambda architecture gained a lot of popularity. The Lambda architecture advocates using a batch system for the “heavy lifting”, augmenting it with a streaming system that “catches up” with data ingestion producing early, but maybe incomplete, results. Then, separate logic tries to “merge” the produced results for serving.
The Lambda architecture had well-known disadvantages, in particular that the merging process was often painful, as was the fact that two separate codebases that express the same logic need to be maintained.
Later, Jay Kreps advocated that only one system, the stream processor, should be used for the entirety of data transformations, drastically simplifying the whole architecture:
Flink is a stream processor that embeds within the streaming topology all kinds of processing: stateful operators, blocking operators, iterations, windowing, as well as the simple stateless operators. Thus, it implements exactly this vision.
However, practice is always a bit more messy. We have seen that batch is a special case of streaming. As batch processors only need to support this special case, they are often in a position to make optimizations and take shortcuts in their internal design that the more general stream processing are not. In particular:
- Batch programs do not need coordination to recover from failures. Instead, failed partitions can simply be restarted (because the input is finite).
- Scheduling of batch jobs can be done in stages instead of bringing the whole topology live at once.
- Query optimization techniques that estimate sizes of intermediate data sets can often be employed to reduce (often drastically) the runtime of the program.
- Operators (such as joins) that can assume that their input is finite can use more efficient internal data structures
Flink internally implements all of these optimizations, resulting in two different code paths for batch and streaming that meet at the same endpoint, the system’s runtime:
The green boxes correspond to the streaming code paths, and the yellow boxes correspond to the batch code path. The foundation of all is a stream processing engine that executes both stream and batch programs. The end result is a system that natively supports stream processing, and treats batch as a special case of streaming by layering batch-specific optimizations on top of the streaming engine.