How to Safely Migrate the BYOC vv-agent to a New Node Pool

BYOC vv-agent Node Pool Migration Procedure

This guide describes how to migrate the BYOC vv-agent (for example version 1.9.4) and all related Pyxis components to a new node pool in a BYOC EKS cluster without disrupting workloads.

When to Migrate a Node Pool

Node pool migrations are typically required for:

Upgrading to new instance types
Increasing cluster capacity
Replacing unhealthy or degraded nodes
Improving isolation for sensitive workloads

Following this procedure ensures:

Zero workspace disruption
No job failures
A controlled transition of all Pyxis pods to the new node pool

Prerequisites

Do not continue unless the workspace is in a stable and healthy state.

Workspace readiness checklist

All jobs show Running or Finished
No jobs are stuck in Deploying or Cancelling
Every Pyxis pod is healthy
Logs show no recent errors or anomalies

Migration Steps

Step 1 - Create the New Node Pool

If the node pool is not yet available:

Create the node pool in EKS.
Apply the required labels.
Confirm all nodes join the cluster and display Ready.
Verify taints and tolerations are configured as expected.

Proceed only when the new node pool is fully healthy.

Step 2 - Cordon the Old Node Pool

Prevent Kubernetes from scheduling new pods on the old pool:

kubectl cordon <node-name>

Nodes should show SchedulingDisabled, ensuring new Pyxis pods land on the upcoming pool.

Step 3 - Delete One Pod from Each Pyxis Pair

Pyxis components run in HA mode with two pods per component:

Component	Pod Count
pyxis-controller	2
pyxis-webhook	2
pyxis-manager	2

Goal: shift pods gradually to avoid downtime.

⚠️ Never delete both pods of a component simultaneously or Pyxis becomes unavailable.

Example commands:

kubectl delete pod <pyxis-controller-xxx>
kubectl delete pod <pyxis-webhook-xxx>
kubectl delete pod <pyxis-manager-xxx>

Because the old pool is cordoned, Kubernetes recreates the deleted pods on the new node pool. After this step, one replica of each pair should already run on the new pool.

Step 4 - Verify Newly Launched Pods

Check the logs of the new pods:

kubectl logs <pyxis-controller-new>
kubectl logs <pyxis-webhook-new>
kubectl logs <pyxis-manager-new>

Confirm the absence of errors, failed health checks, certificate issues, readiness/liveness probe failures, or connectivity problems. Stop and remediate if any problem appears.

Step 5 - Delete the Remaining Pyxis Pods

Once half of each pair is stable, delete the remaining pods that still run on the old pool:

kubectl delete pod <pyxis-remaining-old>

Kubernetes recreates these pods on the new node pool so that every Pyxis pod now resides there.

Step 6 - Validate All Pyxis Pods

Re-check logs for:

Both pyxis-controller pods
Both pyxis-webhook pods
Both pyxis-manager pods

Investigate before continuing if any pod reports errors or fails health checks.

Step 7 - Validate the Workspace and Jobs

In the Ververica UI confirm:

Workspace is Online
vv-agent reports Healthy
Deployments can start and stop normally
No jobs show a degraded state
Checkpoints continue to succeed

Deploy a simple test job to ensure end-to-end behavior is unchanged.

Step 8 - Handle the Old Node Pool

Depending on operational requirements:

If the old node pool remains needed, uncordon the nodes:
kubectl uncordon <node-name>
If retiring the pool, monitor the environment for several days, then delete the old node pool from EKS.

Summary

Cordon the old nodes.
Recreate Pyxis pods on the new pool in two phases.
Validate logs, workspace health, and jobs.
Retire or uncordon the old node pool once stability is confirmed.

Following these steps maintains full BYOC stability during node pool migrations.