Batch is a special case of streaming

September 15, 2015 | by Kostas Tzoumas

Interested in stream processing? Sign up for Flink Forward 2015, the first conference on Apache Flink™.

In recent blog posts, we introduced what we deem as requirements for systems to classify as stream processors, and followed up with a detailed comparison of current approaches to data streaming, including extensive experiments comparing Apache Flink™ and Apache Storm.

We are not the only ones to make the point that streaming systems are reaching a point of maturity which makes older batch systems and architectures look less compelling. But is batch dead already? Not quite, it’s just that streaming is a proper superset of batch, and batch can be served equally well, if not better, with a new breed of modern streaming engines:

streaming-venn

In this post, we dive deeper on how, in principle, batch analytics is just a special case of streaming analytics, we look concretely on how Apache Flink™ implements this vision.

The principles: Bounding unbounded data with windows

We will follow the excellent terminology from Tyler Akidau’s blog post (and largely build on his thoughts). We use the term unbounded data for an infinite, ever-growing data stream, and the term bounded data for a data stream that happens to have a beginning and an end (data ingestion stops after a while). It is clear that the notion of an unbounded data stream includes (is a superset of) the notion of a bounded data set:

unbounded-venn

Streaming applications create bounded data from unbounded data using windows, i.e., creating bounds using some characteristic of the data, most prominently based on timestamps of events. For example, one can choose to create a window of all records that belong to the same session (a session being defined of a period of activity followed by a period of inactivity). The simplest form of a window is (when we know that the input is bounded), to include all the data in one window. Let’s call this a “global window”. This way, we have created a streaming program that does “batch processing”:

windows-venn


The efficiency: Pipelined and batch processing with Flink

Early streaming systems suffered from efficiency problems due to design choices that sacrificed throughput, in particular, record-by-record event processing and acknowledgement. This led to a belief that streaming systems can only “complement” batch systems, or that hybrids of streaming and batching (“micro-batching”) are required for efficiency.

This is, however, no longer true. For example, Flink’s network stack is based on a hybrid runtime that supports pipelined processing, similar to MPP databases, as well as batch processing if needed. This style of processing can support the full spectrum of data exchange, from pure record-by-record shipping to pure batch processing. Flink accumulates records in buffers, and ships these buffers over the network when they are full. This style of processing can emulate both record-by-record processing (regard a buffer as “full” when it has one record), pure batch processing (retain all buffers in memory and disk until the result has been fully materialized), and, interestingly, a sweet spot in the middle, which is the default behavior of Flink and can achieve very high throughput

pipelining


This leads us to establish that pipelining is a proper superset of pure batching:

pipelining-venn


Batch systems are streaming systems in the closet


The little secret of batch processors is that they always include a hidden streaming component. When a batch processor reads a file, it streams the file to the first operator. If the first few operators are record-by-record transformations such as filters, mappers, etc, then the data is still streamed through the operator code (often by “chaining” operators together). However, at the first blocking operator (e.g., sort, aggregation, etc) a batch processor blocks the output until all the input of the operator has been produced.

Of course, blocking operators are blocking by nature, so their input needs indeed to be materialized before they start processing it. For example, a sorting operator needs the whole data to be present before it can return a fully sorted result. However, there is no fundamental reason that blocking should be part of the system and not the operator itself. Flink follows the philosophy of streaming end-to-end, by embedding blocking operators within streaming tasks. The system does not differentiate between a blocking operator and a streaming operator, this is part of the operator logic itself:

blocking-in-streaming


We have seen that it is possible to embed blocking operators into streaming topologies, and thus, streaming topologies strictly subsume classic batch processing DAGs:

dags-venn


Overall, the result of all these observations brings us to the thesis we started from, namely that batch is a special case of streaming:

conclusion-venn


Lambda and Kappa architectures, and batch-specific optimizations


If batch is a special case of streaming, can it be that pure batch applications are better served by the special case? Partly, but this can be fully rectified by implementing batch-specific optimizations on top of a stream processor. Let us look at how Flink in fact implements a hybrid architecture, incorporating optimizations that are specific to batch workloads in the framework.

As older streaming systems lacked support for high throughput and consistent results, the so-called Lambda architecture gained a lot of popularity. The Lambda architecture advocates using a batch system for the “heavy lifting”, augmenting it with a streaming system that “catches up” with data ingestion producing early, but maybe incomplete, results. Then, separate logic tries to “merge” the produced results for serving.

lambda


The Lambda architecture had well-known disadvantages, in particular that the merging process was often painful, as was the fact that two separate codebases that express the same logic need to be maintained.


Later, Jay Kreps
advocated that only one system, the stream processor, should be used for the entirety of data transformations, drastically simplifying the whole architecture:

kappa

Flink is a stream processor that embeds within the streaming topology all kinds of processing: stateful operators, blocking operators, iterations, windowing, as well as the simple stateless operators. Thus, it implements exactly this vision.

However, practice is always a bit more messy. We have seen that batch is a special case of streaming. As batch processors only need to support this special case, they are often in a position to make optimizations and take shortcuts in their internal design that the more general stream processing are not. In particular:

  • Batch programs do not need coordination to recover from failures. Instead, failed partitions can simply be restarted (because the input is finite).

  • Scheduling of batch jobs can be done in stages instead of bringing the whole topology live at once.

  • Query optimization techniques that estimate sizes of intermediate data sets can often be employed to reduce (often drastically) the runtime of the program.

  • Operators (such as joins) that can assume that their input is finite can use more efficient internal data structures

Flink internally implements all of these optimizations, resulting in two different code paths for batch and streaming that meet at the same endpoint, the system’s runtime:

hybrid-stack


The green boxes correspond to the streaming code paths, and the yellow boxes correspond to the batch code path. The foundation of all is a stream processing engine that executes both stream and batch programs. The end result is a system that natively supports stream processing, and treats batch as a special case of streaming by layering batch-specific optimizations on top of the streaming engine.

Topics: Apache Flink

Kostas Tzoumas
Article by:

Kostas Tzoumas

Related articles

Comments

Sign up for Monthly Blog Notifications

Please send me updates about products and services of Ververica via my e-mail address. Ververica will process my personal data in accordance with the Ververica Privacy Policy.

Our Latest Blogs

by Ververica Press Office October 15, 2020

Ververica Stream Alliance: Introducing New Business Partner Category

After the successful introduction of Ververica’s Stream Alliance Program, Ververica today announces an update to the program scheme and the addition of a new partner category, the Ververica...

Read More
by Dongjie Shi & Jiaming Song September 21, 2020

Intel’s distributed Model Inference platform presented at Flink Forward

Flink Forward Global Virtual Conference 2020 is kicking off next month and the Flink community is getting ready to discuss the future of stream processing, and Apache Flink. This time, the...

Read More
by Jark Wu & Qingsheng Ren September 10, 2020

A deep dive on Change Data Capture with Flink SQL during Flink Forward

Can you believe that Flink Forward Global Virtual Conference 2020 is only a few weeks away? 

Read More