[Envoy] Envoy proxy healthchecks #922

klapkov · 2024-03-25T06:21:10Z

Envoy proxy healthchecks

Summary

In the past we have observed cases, where an application is running, but does not accept any connections. When we looked into it, the app healthcheck was passing and the envoy proxy was running as well, but no requests were reaching the app. This leads to this loop:

Gorouter unable to open a connection to the diego cell.
Gorouter prunes the endpoint
Since the app healthcheck passes, the endpoint gets re-registered

This is why we started to look into potential ways to do some sort of healthchecking on the proxy. The best option we currently see is modifying the app healthcheck in a way that also checks the proxy. Currently it uses only the app port. We can add a parallel check that also does the same trough the proxy port. The proxy will then redirect the request to the app and we will receive a response. This of course means two times more healthchecking requests to the app, but this should not have any significant impact.

Of course this extra check functionality could be enabled with a flag in the executor, so it can be used only if needed.

Please let me know what you think on the topic. I think this topic has been discussed in the past and maybe someone could give some context why it was never implemented.

Diego repo

https://github.com/cloudfoundry/executor
https://github.com/cloudfoundry/healthcheck

Viktor-Velkov · 2025-01-29T14:48:47Z

New info: 

Adding envoy proxy liveness check. With this new functionality when the envoy stops accepting TCP connections the health check will fail and the app will be restarted.
With those 2 PR's:
cloudfoundry/executor#110
#985

The changes were tested on test environment and it is visible that there are 3 envoy TCP liveness healthchecks:

The setup we tested was on our environment with the newly implemented envoy liveness check and iptable rule on the container side to drop everything with destination port 61001(envoy), which causes timeout on gorouter side.

iptables -A INPUT -p tcp --dport 61001 -j DROP

After the execution of the iptable rule on the container which drop destination port 61001 we've received the correct error message and then the app was restarted. Which proves that the newly implemented logic is working:

Feedback is highly appreciated.

klapkov added the enhancement label Mar 25, 2024

cf-foundation-community-automation bot added this to Application Runtime Platform Working Group Mar 25, 2024

cf-foundation-community-automation bot moved this to Inbox in Application Runtime Platform Working Group Mar 25, 2024

Viktor-Velkov mentioned this issue Jan 15, 2025

Adding 2 properties into the rep config for the liveness healthcheck … #985

Open

1 task

Viktor-Velkov mentioned this issue Jan 28, 2025

Envoy proxy liveness checks cloudfoundry/executor#110

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Envoy] Envoy proxy healthchecks #922

[Envoy] Envoy proxy healthchecks #922

klapkov commented Mar 25, 2024

Viktor-Velkov commented Jan 29, 2025 •

edited

Loading

[Envoy] Envoy proxy healthchecks #922

[Envoy] Envoy proxy healthchecks #922

Comments

klapkov commented Mar 25, 2024

Envoy proxy healthchecks

Summary

Diego repo

Viktor-Velkov commented Jan 29, 2025 • edited Loading

Viktor-Velkov commented Jan 29, 2025 •

edited

Loading