How to Safely Migrate the BYOC vv-agent to a New Node Pool
Answer
BYOC vv-agent Node Pool Migration Procedure
This guide describes how to migrate the BYOC vv-agent (for example version 1.9.4) and all related Pyxis components to a new node pool in a BYOC EKS cluster without disrupting workloads.
When to Migrate a Node Pool
Node pool migrations are typically required for:
- Upgrading to new instance types
- Increasing cluster capacity
- Replacing unhealthy or degraded nodes
- Improving isolation for sensitive workloads
Following this procedure ensures:
- Zero workspace disruption
- No job failures
- A controlled transition of all Pyxis pods to the new node pool
Prerequisites
Do not continue unless the workspace is in a stable and healthy state.
Workspace readiness checklist
- All jobs show Running or Finished
- No jobs are stuck in Deploying or Cancelling
- Every Pyxis pod is healthy
- Logs show no recent errors or anomalies
Migration Steps
Step 1 - Create the New Node Pool
If the node pool is not yet available:
- Create the node pool in EKS.
- Apply the required labels.
- Confirm all nodes join the cluster and display Ready.
- Verify taints and tolerations are configured as expected.
Proceed only when the new node pool is fully healthy.
Step 2 - Cordon the Old Node Pool
Prevent Kubernetes from scheduling new pods on the old pool:
kubectl cordon <node-name>
Nodes should show SchedulingDisabled, ensuring new Pyxis pods land on the upcoming pool.
Step 3 - Delete One Pod from Each Pyxis Pair
Pyxis components run in HA mode with two pods per component:
| Component | Pod Count |
|---|---|
| pyxis-controller | 2 |
| pyxis-webhook | 2 |
| pyxis-manager | 2 |
Goal: shift pods gradually to avoid downtime.
⚠️ Never delete both pods of a component simultaneously or Pyxis becomes unavailable.
Example commands:
kubectl delete pod <pyxis-controller-xxx>
kubectl delete pod <pyxis-webhook-xxx>
kubectl delete pod <pyxis-manager-xxx>
Because the old pool is cordoned, Kubernetes recreates the deleted pods on the new node pool. After this step, one replica of each pair should already run on the new pool.
Step 4 - Verify Newly Launched Pods
Check the logs of the new pods:
kubectl logs <pyxis-controller-new>
kubectl logs <pyxis-webhook-new>
kubectl logs <pyxis-manager-new>
Confirm the absence of errors, failed health checks, certificate issues, readiness/liveness probe failures, or connectivity problems. Stop and remediate if any problem appears.
Step 5 - Delete the Remaining Pyxis Pods
Once half of each pair is stable, delete the remaining pods that still run on the old pool:
kubectl delete pod <pyxis-remaining-old>
Kubernetes recreates these pods on the new node pool so that every Pyxis pod now resides there.
Step 6 - Validate All Pyxis Pods
Re-check logs for:
- Both
pyxis-controllerpods - Both
pyxis-webhookpods - Both
pyxis-managerpods
Investigate before continuing if any pod reports errors or fails health checks.
Step 7 - Validate the Workspace and Jobs
In the Ververica UI confirm:
- Workspace is Online
vv-agentreports Healthy- Deployments can start and stop normally
- No jobs show a degraded state
- Checkpoints continue to succeed
Deploy a simple test job to ensure end-to-end behavior is unchanged.
Step 8 - Handle the Old Node Pool
Depending on operational requirements:
- If the old node pool remains needed, uncordon the nodes:
kubectl uncordon <node-name> - If retiring the pool, monitor the environment for several days, then delete the old node pool from EKS.
Summary
- Cordon the old nodes.
- Recreate Pyxis pods on the new pool in two phases.
- Validate logs, workspace health, and jobs.
- Retire or uncordon the old node pool once stability is confirmed.
Following these steps maintains full BYOC stability during node pool migrations.