How Apache Flink manages Kafka consumer offsets

October 12, 2018 | by Fabian Hueske

In this blog post, we explain how Apache Flink works with Apache Kafka to ensure that records from Kafka topics are processed with exactly-once guarantees, using a step-by-step example.

Checkpointing is Apache Flink’s internal mechanism to recover from failures. A checkpoint is a consistent copy of the state of a Flink application and includes the reading positions of the input. In case of a failure, Flink recovers an application by loading the application state from the checkpoint and continuing from the restored reading positions as if nothing happened. You can think of a checkpoint as saving the current state of a computer game. If something happens to you after you saved your position in the game, you can always go back in time and try again.

Checkpoints make Apache Flink fault-tolerant and ensure that the semantics of your streaming applications are preserved in case of a failure. Checkpoints are triggered at regular intervals that applications can configure.

The Kafka consumer in Apache Flink integrates with Flink’s checkpointing mechanism as a stateful operator whose state are the read offsets in all Kafka partitions. When a checkpoint is triggered, the offsets for each partition are stored in the checkpoint. Flink’s checkpoint mechanism ensures that the stored states of all operator tasks are consistent, i.e., they are based on the same input data. A checkpoint is completed when all operator tasks successfully stored their state. Hence, the system provides exactly-once state update guarantees when restarting from potential system failures.

Below we describe how Apache Flink checkpoints the Kafka consumer offsets in a step-by-step guide. In our example, the data is stored in Flink’s Job Master. It’s worth noting here that under POC or production use cases, the data would usually be stored in an external file storage such as HDFS or S3.

Step 1:

The example below reads from a Kafka topic with two partitions that each contains “A”, “B”, “C”, ”D”, “E” as messages. We set the offset to zero for both partitions.Apache Flink, Apache Kafka, Kafka consumer, stream processing

Step 2:

In the second step, the Kafka consumer starts reading messages from partition 0. Message “A” is processed “in-flight” and the offset of the first consumer is changed to 1.
Apache Flink, Apache Kafka, Kafka consumer, stream processing

Step 3:

In the third step, message “A” arrives at the Flink Map Task. Both consumers read their next records (message “B” for partition 0 and message “A” for partition 1). The offsets are updated to 2 and 1 respectively for both partitions. At the same time, Flink's Job Master decides to trigger a checkpoint at the source.  

Apache Flink, Apache Kafka, Kafka consumer, stream processing

Step 4:

In the following step, the Kafka consumer tasks have already created a snapshot of their states (“offset = 2, 1”) which is now stored in Apache Flink’s Job Master. The sources emit a checkpoint barrier after messages “B” and “A” from partitions 0 and 1 respectively. The checkpoint barriers are used to align the checkpoints of all operator tasks and guarantee the consistency of the overall checkpoint. Message “A” arrives at the Flink Map Task while the top consumer continues reading its next record (message “C”).  

Apache Flink, Apache Kafka, Kafka consumer, stream processing

Step 5:

This step shows that the Flink Map Task receives the checkpoint barriers from both sources and checkpoints its state to the Job Master. Meanwhile, the consumers continue reading more events from the Kafka partitions.
Apache Flink, Apache Kafka, Kafka consumer, stream processing

Step 6:

This step shows that the Flink Map Task communicates to Flink Job Master once it checkpointed its state. When all tasks of a job acknowledge that their state is checkpointed, the Job Master completes the checkpoint. From now on, the checkpoint can be used to recover from a failure. It’s worth mentioning here that Apache Flink does not rely on the Kafka offsets for restoring from potential system failures.
Apache Flink, Apache Kafka, Kafka consumer, stream processing

Recovery in case of a failure

In case of a failure (for instance, a worker failure) all operator tasks are restarted and their state is reset to the last completed checkpoint. As it is depicted in the following illustration.
Apache Flink, Apache Kafka, Kafka consumer, stream processing

The Kafka sources start from offset 2 and 1 respectively as this was the offset of the completed checkpoint. When the job is restarted we can expect a normal system operation as if no failure occurred before.

You can find more information and frequently asked questions regarding how to use best Apache Flink and Apache Kafka in one of our previous blog posts here.

As always, we welcome any feedback or suggestions through our contact forms and through the Apache Flink mailing list.  

Apache Flink website

Ververica Contact





Topics: Apache Flink

Fabian Hueske
Article by:

Fabian Hueske

Find me on:

Related articles


Sign up for Monthly Blog Notifications

Please send me updates about products and services of Ververica via my e-mail address. Ververica will process my personal data in accordance with the Ververica Privacy Policy.

Our Latest Blogs

by Frédérique Mittelstaedt October 19, 2021

Keeping Redditors safe with Stateful Functions - Flink Forward 2021

London-based Frédérique Mittelstaedt leads the real-time safety applications team at Reddit. His team keeps Redditors safe by automating the detection and actioning of harmful user behaviour and...

Read More
by Brent Davis October 06, 2021

Building Apache Flink Streaming at Splunk - Flink Forward 2021

Brent Davis, Principal Performance Engineer at Splunk, will deliver a technical session on  Sources, Sinks, and Operators: A Performance Deep Dive on October 27 at the upcoming Flink Forward...

Read More
by Chen Qin September 21, 2021

The Apache Flink Story at Pinterest - Flink Forward Global 2021

On October 27, at the annual Apache Flink user conference, Flink Forward Global 2021, Pinterest Tech Lead, Chen Qin will deliver a keynote talk on “Sharing what we love: The Apache Flink Story at...

Read More