Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use static page for broker liveness probe #265

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pgier
Copy link
Collaborator

@pgier pgier commented Nov 11, 2022

This brings the broker liveness probe in sync with the community Helm chart, and should use less resources than the health_check script.
In Astra Streaming we switch to this liveness probe instead of using the metrics endpoint because we were getting a lot of errors in the broker logs (https://github.com/riptano/astra-streaming/pull/513).

This brings the broker liveness probe in sync with the community
Helm chart, and should use less resources than the health_check script.
@pgier
Copy link
Collaborator Author

pgier commented Nov 11, 2022

@cdbartholomew PTAL. This is how the Apache community broker is configured and we're doing the same in Astra Streaming.

Copy link
Contributor

@nicoloboschi nicoloboschi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tradeoff is that when the broker is deadlocked, k8s will no longer restart the pod. The current health check fails when there is deadlock.

@pgier
Copy link
Collaborator Author

pgier commented Nov 14, 2022

In Astra we were previously seeing brokers regularly restarting because the health check was failing, possibly incorrectly (maybe @zzzming has more info) when using the health check. So we switched to use the metrics endpoint, but then we were seeing brokers stuck in a running but not ready state. That issue seems much better in the current 2.10 versions that we're running. We switched to using the static page instead of the metrics endpoint in Astra, and it seems to be fine for the past couple weeks.

@michaeljmarshall
Copy link
Member

In Astra we were previously seeing brokers regularly restarting because the health check was failing, possibly incorrectly (maybe @zzzming has more info) when using the health check. So we switched to use the metrics endpoint, but then we were seeing brokers stuck in a running but not ready state. That issue seems much better in the current 2.10 versions that we're running. We switched to using the static page instead of the metrics endpoint in Astra, and it seems to be fine for the past couple weeks.

It'd be really helpful to know why the health check was failing. Another side effect of this change is that the pod could fail its readiness probe without failing the liveness probe, which can lead to problems with DNS lookups when deploying the brokers as a statefulset.

@pgier
Copy link
Collaborator Author

pgier commented Nov 14, 2022

Part of the issue was that the healthcheck topics would build up a very large backlog. Maybe the healthcheck was timing out and not acknowledging messages, and this was causing it to fail?

@michaeljmarshall
Copy link
Member

Do we have an issue opened in the upstream project? That sounds like a bug.

@pgier
Copy link
Collaborator Author

pgier commented Dec 14, 2022

@michaeljmarshall I think the issue was fixed in 2.10. At least we haven't seen it in the last couple months. Maybe we need a new endpoint specific to the liveness check?

@michaeljmarshall
Copy link
Member

A dedicated liveness check could make sense. We'd just need to find the right things to check. I thought about this a few months ago, but I didn't come up with a good solution. Maybe it is worth a discussion on the dev list to ask "when is a broker alive and when is it ready?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants