How Ververica Platform's and Apache Flink's Kubernetes HA Work and Where They Differ

Question

I want to configure Kubernetes High-Availability (HA) for my Flink cluster. Could you tell me how Ververica Platform's Kubernetes HA and Apache Flink's Kubernetes HA work? What are the differences between these two solutions?

Answer

Note: This article applies to Flink 1.8 and later, running in Ververica Platform 2.0+.

Since Ververica Platform 2.8, we have integrated Apache Flink Kubernetes HA and adapted it to support the LATEST_STATE restore strategy in Ververica Platform. All users of Ververica Platform 2.8 or later are recommended to use Flink Kubernetes HA for their deployments.

What is High Availability in Flink in General

For a Flink cluster without HA, if the JM crashes, no new jobs can be submitted and all running jobs will fail. After HA is enabled, new jobs can be submitted immediately when a new JM is ready. Running jobs will continue to make progress after one more step - restarting from the previous checkpoint. Yes, you can consider it as the job's downtime. In Flink, you must always expect restarts and tune your application such that your SLAs still hold with a sporadic restart.

Apache Flink's Kubernetes HA

Apache Flink's Kubernetes HA can be activated by the following Flink configuration:

high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: /path/to/ha/store
kubernetes.cluster-id: cluster-id

Typically, in this approach, there is a single leading JobManager at any time and multiple standby JobManagers to take over leadership in case the leader fails.

Ververica Platform's Kubernetes HA

Ververica Platform's Kubernetes HA can be activated by

Ververica Platform 2.8 or later

kind: Deployment  # this is Ververica Platform deployment, not the Kubernetes one
spec:
  template:
    spec:
      flinkConfiguration:
        high-availability: kubernetes
        high-availability.storageDir: /path/to/ha/store # optional, see more below
        high-availability.vvp-kubernetes.job-graph-store.enabled: true  # optional,see more below

Ververica Platform 2.7 or earlier

kind: Deployment  # this is Ververica Platform deployment, not the Kubernetes one
spec:
  template:
    spec:
      flinkConfiguration:
        high-availability: vvp-kubernetes
        high-availability.storageDir: /path/to/ha/store # optional, see more below
        high-availability.vvp-kubernetes.job-graph-store.enabled: true  # optional,see more below

Unlike Apache Flink's Kubernetes HA, this approach starts a single JobManager (JM) and manages it with a Kubernetes Job resourcewith the following specification:

kind: Job
spec:
  completions: 1
  parallelism: 1

Whenever that JM crashes, i.e., not exited with 0, Kubernetes will launch another JM as a replacement. With the help of high availability storage, the new JM knows where to start. The high availability storage is configured automatically if Ververica Platform's Universal Blob Storage is configured.

In addition, with the configuration high-availability.kubernetes.job-graph-store.enabled: true or high-availability.vvp-kubernetes.job-graph-store.enabled:true, you can also store the generated job graph in the high availability storage. Upon JM failover,instead of re-executing your job's main() to generate the job graph, the stored job graph will be used.

Comparison

In all Kubernetes HA approaches described above, when the active JM (the leading JM in the case of Apache Flink's Kubernetes HA, the single JM in the case of Ververica Platform's original and new Kubernetes HA) is lost, the TaskManagers (TMs) who are talking to the failed JM will fail the tasks running on them. This leads to a job restart from the previous checkpoint. Flink's restart strategy controls how restarts are handled.

In terms of failover and recovery time, Apache Flink's Kubernetes HA can switch over to one of the stand-by JM quickly as it is already started. With Ververica Platform's Kubernetes HA, a new JM pod needs to be launched and started. But the execution of the job's main() can be skipped if you enable the job graph store. The actual recovery time depends on your job and your cluster environment.

Another difference is that Ververica Platform's Kubernetes HA runs a single leader election process for the entire JM process while Apache Flink's Kubernetes HA runs multiple of them, one for Dispatcher, one for ResourceManager, one for JobManager, etc. Flink 1.15 followed the same route as Ververica Platform did to implement amultiple-component leader election service. See FLINK-24038 for more details.

The biggest advantage of Ververica Platform's Kubernetes HA is that, if you set RestoreStrategy=LATEST_STATE in your deployment, even if the job fails completely (e.g., exhausted the configured restart attempts) after the underlying issue is fixed, the job can still restart from the latest state (i.e., the previously completed checkpoint or savepoint).

Summary

This table summarizes the comparison between both Kubernetes HA approaches.

	Apache Flink Kubernetes HA	Ververica Platform Kubernetes HA
The number of JMs	one active, multiple standbys	one
Is the job restarted upon failover?	yes
Is the job graph regenerated upon failover?	yes, `main()` is re-executed to generate the job graph	Can (optionally) be stored in the HA store for re-use
The number of leader election processes in JM	multiple (Flink <1.15) single (Flink 1.15 or later)	single
Can recover from `LATEST_STATE`	no	yes

Related Information

Ververica Platform's Kubernetes HA
Apache Flink's Kubernetes HA
FLINK-24038: Flink moves to single leader election in Flink 1.15
Ververica Platform's LATEST_STATE RestoreStrategy
Kubernetes: Job