use static page for broker liveness probe #265

pgier · 2022-11-11T15:32:27Z

This brings the broker liveness probe in sync with the community Helm chart, and should use less resources than the health_check script.
In Astra Streaming we switch to this liveness probe instead of using the metrics endpoint because we were getting a lot of errors in the broker logs (https://github.com/riptano/astra-streaming/pull/513).

This brings the broker liveness probe in sync with the community Helm chart, and should use less resources than the health_check script.

pgier · 2022-11-11T15:39:26Z

@cdbartholomew PTAL. This is how the Apache community broker is configured and we're doing the same in Astra Streaming.

nicoloboschi

LGTM

michaeljmarshall

The tradeoff is that when the broker is deadlocked, k8s will no longer restart the pod. The current health check fails when there is deadlock.

pgier · 2022-11-14T14:37:39Z

In Astra we were previously seeing brokers regularly restarting because the health check was failing, possibly incorrectly (maybe @zzzming has more info) when using the health check. So we switched to use the metrics endpoint, but then we were seeing brokers stuck in a running but not ready state. That issue seems much better in the current 2.10 versions that we're running. We switched to using the static page instead of the metrics endpoint in Astra, and it seems to be fine for the past couple weeks.

michaeljmarshall · 2022-11-14T16:18:48Z

In Astra we were previously seeing brokers regularly restarting because the health check was failing, possibly incorrectly (maybe @zzzming has more info) when using the health check. So we switched to use the metrics endpoint, but then we were seeing brokers stuck in a running but not ready state. That issue seems much better in the current 2.10 versions that we're running. We switched to using the static page instead of the metrics endpoint in Astra, and it seems to be fine for the past couple weeks.

It'd be really helpful to know why the health check was failing. Another side effect of this change is that the pod could fail its readiness probe without failing the liveness probe, which can lead to problems with DNS lookups when deploying the brokers as a statefulset.

pgier · 2022-11-14T17:13:30Z

Part of the issue was that the healthcheck topics would build up a very large backlog. Maybe the healthcheck was timing out and not acknowledging messages, and this was causing it to fail?

michaeljmarshall · 2022-11-14T17:20:43Z

Do we have an issue opened in the upstream project? That sounds like a bug.

pgier · 2022-12-14T20:58:06Z

@michaeljmarshall I think the issue was fixed in 2.10. At least we haven't seen it in the last couple months. Maybe we need a new endpoint specific to the liveness check?

michaeljmarshall · 2022-12-16T05:51:35Z

A dedicated liveness check could make sense. We'd just need to find the right things to check. I thought about this a few months ago, but I didn't come up with a good solution. Maybe it is worth a discussion on the dev list to ask "when is a broker alive and when is it ready?"

use static page for broker liveness probe

cb3b0e8

This brings the broker liveness probe in sync with the community Helm chart, and should use less resources than the health_check script.

nicoloboschi approved these changes Nov 11, 2022

View reviewed changes

michaeljmarshall reviewed Nov 11, 2022

View reviewed changes

michaeljmarshall requested a review from lhotari November 11, 2022 21:18

eolivelli approved these changes Nov 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use static page for broker liveness probe #265

use static page for broker liveness probe #265

pgier commented Nov 11, 2022 •

edited

Loading

pgier commented Nov 11, 2022

nicoloboschi left a comment

michaeljmarshall left a comment

pgier commented Nov 14, 2022

michaeljmarshall commented Nov 14, 2022

pgier commented Nov 14, 2022

michaeljmarshall commented Nov 14, 2022

pgier commented Dec 14, 2022

michaeljmarshall commented Dec 16, 2022

use static page for broker liveness probe #265

Are you sure you want to change the base?

use static page for broker liveness probe #265

Conversation

pgier commented Nov 11, 2022 • edited Loading

pgier commented Nov 11, 2022

nicoloboschi left a comment

Choose a reason for hiding this comment

michaeljmarshall left a comment

Choose a reason for hiding this comment

pgier commented Nov 14, 2022

michaeljmarshall commented Nov 14, 2022

pgier commented Nov 14, 2022

michaeljmarshall commented Nov 14, 2022

pgier commented Dec 14, 2022

michaeljmarshall commented Dec 16, 2022

pgier commented Nov 11, 2022 •

edited

Loading