-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure webhook availability during Konnectivity Agent rolling update #566
Comments
Short answer: the life of a proxied TCP connection from API Server to your Gatekeeper endpoint is not expected to survive across a rolling restart of Agent pods. What is your use case / goal? In GKE, you should be able to use maintenance windows to prevent Agent restarts at sensitive times; note that this should also cover control plane (API Server and Konnectivity Server) restarts / maintenance. Another strategy to reduce Agent churn is to use a dedicated / stable node pool for |
@jkh52 In general we try to configure all cluster workloads to gracefully handle being evicted and rescheduled to a different node without any user impact, just to make it easy for us to change infrastructure configuration without users noticing. Specifically for Konnectivity Agent, we would like to avoid API server requests being unexpectedly rejected (when using fail closed webhooks) or unexpectedly accepted (when using fail open webhooks). The following are the most common scenarios that would cause Konnectivity Agent to be evicted/rescheduled (and therefore could cause a potential webhook request failure):
As you said, we can take steps to reduce the chance of Konnectivty Agent being evicted/rescheduled but we should expect webhook request failures under normal operation, so I think this issue can be closed unless there is any value in keeping it open |
Thanks for providing details. Sounds good, I will close this. |
Re-opening to track potential improvements around agent shutdown. There is a fairly wide time interval between an agent's first SIGTERM and final termination (terminationGracePeriodSeconds The historical decision has been for agents stay connected to servers during this time, and continue routing data packets. The protocol could be extended to allow a given |
@jkh52 This would be amazing! An alternative would be to deploy Konnectivity Agent as a static kubelet Pod and dialing the Konnectivity Agent only on the Node where the target Pod is running so that the Pod can drain connections when the Node is drained but before Konnectivity Agent is terminated (I guess similar to how GKE used to do with the SSH daemon), but I believe this would have scalability issues for large clusters (at least according to this video) |
@dippynark Note that kube-apiserver re-uses established proxy TCP connections up to a certain TTL period. If your webhook calls are frequent enough, we should expect this case. So the proposal is not "airtight", since konnectivity-server agent selection is (initial) dial-time only. The konnectivity-client library gives kube-apiserver a net.Conn object, so it is difficult to support "will Close() soon" semantics there. References: DialContext and conn.go |
@jkh52 speaking to Google support, they closed my ticket related to this issue because a Google engineer on GitHub (which I am assuming is you) wanted to continue on GitHub. Google support said there were workarounds for this issue, which I guess is referring to the your previous comment; I wouldn't really say they are workarounds, they just reduce the chance of the issue occurring, however happy for this ticket to be closed as I agree that supporting the suggested protocol improvements would be complex and even implementing retries at the API server sounds tricky due to the multiple TCP hops from API server to webhook Pod(s). We will likely also do the following to reduce the impact of this issue:
|
Your plan sounds good, especially in the near term, since OSS improvements will take a while to rollout to GKE (likely introduced in a minor version e.g. 1.30 or 1.31). I will still take a look at this low hanging fruit idea. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
I think I maybe ran into a fun example of this where we have a an eviction webhook. kubectl drain would work fine until the konnectivityp-agent it was proxying requests through was evicted (there as another konnecitivy agent up and ready). Would get an EOF that stopped kubectl drain Other drainers will retry and be fine but I thought it was an interesting example of how konnectiivty not draining out its proxies connections can unexpectedly break webhooks. I have a work around but wanted to hopefully give @jkh52 some motivation and thanks for the work he's doing |
We are running Gatekeeper as a validating webhook on GKE (although I don't think the webhook implementation or cloud provider matters) and we have a test where we perform a rolling update of Gatekeeper while continuously making Kubernetes API requests that should be rejected to ensure requests/connections are being drained properly.
However, if we also delete Konnectivity Agent Pods while rolling Gatekeeper (gradually, to ensure that the Konnectivity Agent Pods aren't all down at the same time) or perform a rolling update (
kubectl rollout restart deployment -n kube-system konnectivity-agent
) then a few requests are allowed through (theValidatingWebhookConfiguration
is configured to fail open).My question is whether this is an issue or whether Konnectivity Agent is behaving as expected? I guess this is happening because the long-lived HTTP keepalive connections between the Kubernetes API server and Gatekeeper (via Konnectivity Server and Konnectivity Agent) are being broken when Konnectivity Agent terminates and are not being drained properly (because there is no way for Konnectivity Agent to inspect the encrypted requests and disable keepalive before shutting down).
Should the Kubernetes API server be able to detect such TCP disconnects and retry validation after reconnecting?
The text was updated successfully, but these errors were encountered: