How do I migrate my workloads (VVP, flink deployments) to new nodes/nodegroups with minimal downtime?
Below is a simple, practical and low-risk method for migrating your workloads (including Flink deployments and VVP) to a new nodegroup when changing instance types, while minimizing downtime.
Assumptions:
- The cluster has enough capacity to temporarily run two nodeGroups in parallel during the migration process
- No critical workloads rely on node-local storage that will be lost during cordon/drain
- Your stateful jobs have checkpointing enabled, checkpoints are retained and the Restore Strategy for these jobs is set to Latest State
Step‑by‑Step Process:
1) Create a new nodegroup (Nodegroup B)- Copy all settings from the existing nodegroup (Nodegroup A) to make sure the IAM role, subnets, labels and all relevant configs are identical
- Only change instance type (if necessary)
- Get the list of node IDs of the nodes running in Nodegroup A
kubectl get nodes -L eks.amazonaws.com/nodegroup - For each of the old node, apply the cordon command to avoid new pods to be scheduled on them
kubectl cordon <node-id>
- Stop the job
- Start the job
- Confirm that the pods are scheduled on Nodegroup B
- Safely evict any remaining pods:
kubectl drain <node-id> --ignore-daemonsets --delete-emptydir-data
Note: If VVP was originally running on Nodegroup A, this step will automatically schedule the new VVP pod on Nodegroup B
5) Delete Nodegroup A