Flink Forward 2025 Barcelona: The Future of AI is Real-Time
What is the process for scaling a running Flink job in or out, and how do the different upgrade and restore strategies of Ververica Platform play together with this?
Note: This section applies to Flink 1.5 or later.
In order to re-scale any Flink job:
parallelism <= maxParallelism
.Since Flink 1.5, flink modify --parallelism <newParallelism>
may be used to change the parallelism in one command. It will try to perform these actions in one go. If taking a savepoint fails, the whole operation will fail.
If you are using Flink 1.13 or later, the MVP (“minimum viable product”) feature of reactive mode can also be used to scale jobs up and down automatically by allocating more or fewer TM managers.
Flink version 1.18 also introduces support for "Externalized Declarative Resource Management" which enables true auto-scaling via a new Flink REST-API endpoint paired with the open source Kubernetes Operator.
Please review the latest details and limitations of this feature here:https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/elastic_scaling/#externalized-declarative-resource-management
Note: This section applies to Ververica Platform 1.3 or later.
Ververica Platform exposes rescaling operations with a few more options and works slightly differently: Simply edit an existing deployment, change the parallelism, and apply changes. Ververica Platform will perform the required operations in the background.
Two properties of the deployment influence the scaling behavior: spec.upgradeStrategy
and spec.restoreStrategy
.
RUNNING
state. Use LATEST_STATE
, for example, to start from the latest successful checkpoint or savepoint known to Ververica Platform.STATELESS
will terminate the currently running job immediately and start a new job based on the restore strategy above. STATEFUL
will first take a savepoint and then continue the same way. If taking the savepoint fails (after retries), the upgrade process stops and the job continues running.Ververica Platform will retry the operation in case of failures.
Tip: Ververica Platform 2.2 (and later) Enterprise Edition adds a new Autopilot feature which comes with autoscaling support for your Flink 1.11 or newer deployments.