How do I migrate my workloads (VVP, flink deployments) to new nodes/nodegroups with minimal downtime?

Skip to content

Contact us

Assumptions:

The cluster has enough capacity to temporarily run two nodeGroups in parallel during the migration process
No critical workloads rely on node-local storage that will be lost during cordon/drain
Your stateful jobs have checkpointing enabled, checkpoints are retained and the Restore Strategy for these jobs is set to Latest State

Step‑by‑Step Process:

1) Create a new nodegroup (Nodegroup B)

Copy all settings from the existing nodegroup (Nodegroup A) to make sure the IAM role, subnets, labels and all relevant configs are identical
Only change instance type (if necessary)

2) Cordon Nodegroup A

Get the list of node IDs of the nodes running in Nodegroup A

kubectl get nodes -L eks.amazonaws.com/nodegroup
For each of the old node, apply the cordon command to avoid new pods to be scheduled on them

kubectl cordon <node-id>

3) Restart each of the flink jobs one at a time

Stop the job
Start the job
Confirm that the pods are scheduled on Nodegroup B

4) Drain old nodes

Safely evict any remaining pods:

kubectl drain <node-id> --ignore-daemonsets --delete-emptydir-data

Note: If VVP was originally running on Nodegroup A, this step will automatically schedule the new VVP pod on Nodegroup B

5) Delete Nodegroup A