Skip to content
  • There are no suggestions because the search field is empty.

How do I migrate my workloads (VVP, flink deployments) to new nodes/nodegroups with minimal downtime?

Below is a simple, practical and low-risk method for migrating your workloads (including Flink deployments and VVP) to a new nodegroup when changing instance types, while minimizing downtime.

Assumptions:

  • The cluster has enough capacity to temporarily run two nodeGroups in parallel during the migration process
  • No critical workloads rely on node-local storage that will be lost during cordon/drain
  • Your stateful jobs have checkpointing enabled, checkpoints are retained and the Restore Strategy for these jobs is set to Latest State

Step‑by‑Step Process:

1) Create a new nodegroup (Nodegroup B)
  • Copy all settings from the existing nodegroup (Nodegroup A) to make sure the IAM role, subnets, labels and all relevant configs are identical
  • Only change instance type (if necessary)
2) Cordon Nodegroup A
  • Get the list of node IDs of the nodes running in Nodegroup A

    kubectl get nodes -L eks.amazonaws.com/nodegroup

  • For each of the old node, apply the cordon command to avoid new pods to be scheduled on them

    kubectl cordon <node-id>

3) Restart each of the flink jobs one at a time
  1. Stop the job
  2. Start the job
  3. Confirm that the pods are scheduled on Nodegroup B
4) Drain old nodes
  • Safely evict any remaining pods:

    kubectl drain <node-id> --ignore-daemonsets --delete-emptydir-data

Note: If VVP was originally running on Nodegroup A, this step will automatically schedule the new VVP pod on Nodegroup B

5) Delete Nodegroup A