How to Configure Kubernetes HTTP Client Timeouts for Ververica Platform Appmanager
Flink deployments managed by Ververica Platform fail to start with the error "Failed to find Kubernetes deployment resource with name ... for the Flink cluster. Please verify that no external party deleted it." How can I fix this?
Answer
Note: This section applies to Ververica Platform 2.9.1+
When Ververica Platform creates Flink clusters, its appmanager component communicates with the Kubernetes API server to create resources such as JobManager Jobs and TaskManager Deployments. This communication is subject to HTTP client timeouts for both connection establishment and response reading.
By default, both the connect timeout and read timeout are set to 10 seconds. If the Kubernetes API server takes longer than this to respond, the appmanager disconnects before receiving the response and assumes the resource was never created, causing the deployment to fail.
When Would You Need to Increase These Timeouts?
The Kubernetes API server may take longer than 10 seconds to respond in environments where:
- Admission webhooks are configured (e.g., Kyverno, OPA Gatekeeper, Aqua Security, Istio) — these webhooks intercept resource creation requests and perform policy evaluation, mutation, or validation before the API server returns a response. Complex policies or chained webhooks can add significant latency.
- The Kubernetes API server is under heavy load — large clusters with many concurrent operations may experience slower API response times.
- Network latency is high — environments where the Ververica Platform appmanager and the Kubernetes API server are separated by high-latency network paths.
In these cases, the appmanager's HTTP client times out waiting for the Kubernetes API response. The resource may actually be created successfully by the API server (after the webhook completes), but because the appmanager has already disconnected, it never receives the confirmation and reports a failure.
This issue most commonly affects TaskManager pods (created as Kubernetes Deployments under apps/v1) rather than JobManager pods (created as Kubernetes Jobs under batch/v1), because admission webhooks often apply more policies to Deployments than to Jobs, resulting in longer processing times.
Solution
Increase the Kubernetes HTTP client timeouts by adding the following configuration to your Ververica Platform Helm values.yaml:
vvp:
appmanager:
cluster:
kubernetes.http-client.read-timeout-millis: "30000"
kubernetes.http-client.connect-timeout-millis: "30000"
Then redeploy Ververica Platform with the updated values:
helm upgrade <release-name> ververica-platform -f values.yaml -n <namespace>
Configuration Reference
| Configuration Key | Description | Default Value |
|---|---|---|
kubernetes.http-client.connect-timeout-millis |
Maximum time (in milliseconds) to wait for a TCP connection to the Kubernetes API server to be established | 10000 (10 seconds) |
kubernetes.http-client.read-timeout-millis |
Maximum time (in milliseconds) to wait for a response from the Kubernetes API server after a request has been sent | 10000 (10 seconds) |
These values should be set higher than the longest expected response time from the Kubernetes API server. A value of 30000 (30 seconds) is a reasonable starting point for environments with admission webhooks. If your webhooks have a timeoutSeconds configured (e.g., 30 seconds), ensure the VVP read timeout exceeds that value.
Verifying the Configuration
After redeploying, verify the configuration is applied by inspecting the appmanager ConfigMap:
kubectl get configmap <release-name>-ververica-platform-config -n <namespace> \
-o jsonpath='{.data.application-prod-user\.yaml}' | grep -A5 "http-client"
You should see:
kubernetes.http-client.connect-timeout-millis: "30000"
kubernetes.http-client.read-timeout-millis: "30000"
Symptoms of This Issue
- Flink deployments fail during cluster startup with the error: "Failed to find Kubernetes deployment resource with name ... for the Flink cluster."
- JobManager pod is created successfully, but TaskManager pod is missing.
- The issue is intermittent and correlates with admission webhook processing time.
Written by Naci Simsek · Published 12 Mar 2026 · Last updated 12 Mar 2026