-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use static page for broker liveness probe #265
base: master
Are you sure you want to change the base?
Conversation
This brings the broker liveness probe in sync with the community Helm chart, and should use less resources than the health_check script.
@cdbartholomew PTAL. This is how the Apache community broker is configured and we're doing the same in Astra Streaming. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tradeoff is that when the broker is deadlocked, k8s will no longer restart the pod. The current health check fails when there is deadlock.
In Astra we were previously seeing brokers regularly restarting because the health check was failing, possibly incorrectly (maybe @zzzming has more info) when using the health check. So we switched to use the metrics endpoint, but then we were seeing brokers stuck in a running but not ready state. That issue seems much better in the current 2.10 versions that we're running. We switched to using the static page instead of the metrics endpoint in Astra, and it seems to be fine for the past couple weeks. |
It'd be really helpful to know why the health check was failing. Another side effect of this change is that the pod could fail its readiness probe without failing the liveness probe, which can lead to problems with DNS lookups when deploying the brokers as a statefulset. |
Part of the issue was that the healthcheck topics would build up a very large backlog. Maybe the healthcheck was timing out and not acknowledging messages, and this was causing it to fail? |
Do we have an issue opened in the upstream project? That sounds like a bug. |
@michaeljmarshall I think the issue was fixed in 2.10. At least we haven't seen it in the last couple months. Maybe we need a new endpoint specific to the liveness check? |
A dedicated liveness check could make sense. We'd just need to find the right things to check. I thought about this a few months ago, but I didn't come up with a good solution. Maybe it is worth a discussion on the dev list to ask "when is a broker alive and when is it ready?" |
This brings the broker liveness probe in sync with the community Helm chart, and should use less resources than the health_check script.
In Astra Streaming we switch to this liveness probe instead of using the metrics endpoint because we were getting a lot of errors in the broker logs (https://github.com/riptano/astra-streaming/pull/513).