How to rescale a running Flink job

Question

What is the process for scaling a running Flink job in or out, and how do the different upgrade and restore strategies of Ververica Platform play together with this?

Answer

Plain Apache Flink

Note: This section applies to Flink 1.5 or later.

In order to re-scale any Flink job:

  1. take a savepoint,
  2. stop the job,
  3. restart from the previously taken savepoint using any parallelism <= maxParallelism.

Since Flink 1.5, flink modify --parallelism <newParallelism> may be used to change the parallelism in one command. It will try to perform these actions in one go. If taking a savepoint fails, the whole operation will fail.

If you are using Flink 1.13 or later, the MVP (“minimum viable product”) feature of reactive mode can also be used to scale jobs up and down automatically by allocating more or fewer TM managers.

Flink version 1.18 also introduces support for "Externalized Declarative Resource Management" which enables true auto-scaling via a new Flink REST-API endpoint paired with the open source Kubernetes Operator.

Please review the latest details and limitations of this feature here:https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/elastic_scaling/#externalized-declarative-resource-management

Apache Flink with Ververica Platform

Note: This section applies to Ververica Platform 1.3 or later.

Ververica Platform exposes rescaling operations with a few more options and works slightly differently: Simply edit an existing deployment, change the parallelism, and apply changes. Ververica Platform will perform the required operations in the background.

Two properties of the deployment influence the scaling behavior: spec.upgradeStrategy and spec.restoreStrategy.

  • The restore strategy defines the state to start with when transitioning the job into Ververica Platform's RUNNING state. Use LATEST_STATE, for example, to start from the latest successful checkpoint or savepoint known to Ververica Platform.
  • The upgrade strategy defines what happens if the deployment specification is updated, for example, by changing the parallelism. STATELESS will terminate the currently running job immediately and start a new job based on the restore strategy above. STATEFUL will first take a savepoint and then continue the same way. If taking the savepoint fails (after retries), the upgrade process stops and the job continues running.

Ververica Platform will retry the operation in case of failures.

Tip: Ververica Platform 2.2 (and later) Enterprise Edition adds a new Autopilot feature which comes with autoscaling support for your Flink 1.11 or newer deployments.

Related Information