-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: minimal health check #2092
Conversation
5b41599
to
df52a02
Compare
The admin server is implemented now, it's only activated when a port is specified(in The health check is responding with a 200 status when postgrest is healthy and with 503 when the connection is down. There's no body sent. Also, I'm not logging the health endpoint hits. |
1a17372
to
c110269
Compare
I thought we could consolidate both the cases for However the above is wrong because as mentioned on automatic connection recovery, when So a |
927c3ee
to
15d8b87
Compare
There are some codecov errors because currently our io tests don't test a down connection. I've added the manual tests here for now. |
15d8b87
to
6e1892e
Compare
I've named the endpoint as Also, I decided to not do any logging when hitting |
@@ -208,6 +209,7 @@ listener appState = do | |||
handleFinally dbChannel _ = do | |||
-- if the thread dies, we try to recover | |||
AppState.logWithZTime appState $ "Retrying listening for notifications on the " <> dbChannel <> " channel.." | |||
AppState.putIsListenerOn appState False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov warning should be gone when #1766 (comment) is done.
That's not something we can guarantee anyway. So the healthcheck will always be only about proper connection status. |
The current logic only tests the database connection, but does not really make sure whether the main port / socket is still open. Does it respond with an unhealthy status for #2042 for sure? How about other cases? This was also mentioned in #1933 (comment):
I think this basically an issue with the two-ports approach. And also there will be additional effort to implement unix sockets for that. I still think the config option to set an endpoint for those types of checks on the main server would be much better, as it would test the connection both ways (http connection and database connection). |
Hm, I thought the max number of open files was defined per process(due to |
Yeah, you're probably right here. However, I meant to use this as an example for the more general problem of how to make sure that we are really accepting connections properly on the main port. This should be a core part of the health check - and in my opinion can only be done reliably by connecting to that port... |
However, what we can and should do is: Keep track of the status of the last schema cache refresh - and if that fails, e.g. because of something like #2024, we should return 503. Currently we are still returning 200, when the schema cache fails. You can test this by adding the SQL code in #2102 (but not the fix in the query) to your fixtures - then run |
Here's a better example. Maybe made up a bit, but still valid: Run postgrest on a unix socket, then remove the socket file manually. The health endpoint will still return "healty" / 200 - even though nobody can connect to postgrest anymore. This is not healthy. Overall, the health endpoint only really makes sense, if it can reliably tell you whether postgrest is healthy or not. And in the current implementation this is not the case, yet. |
Aha, I actually saw a similar case of a failed schema cache load. A db had a crazy view(maybe autogenerated) that not even
Perhaps we can catch that error with setOnClose? Then update the health check state accordingly. Edit: Here we could keep it simple and also die.
Maybe Warp functions can help us tie the main app port state to the health check port. Disadvantages of doing the health check on the main app port:
Because of those, I think it's better to keep the health check port separate from the main app port. @wolfgangwalther WDYT? |
Ah, can you still get a hold of that view definition? If possible I'd like to see whether we can do anything to improve the view parsing to avoid that.
If we think that further, we will eventually end up dying on everything that would make the health check respond unhealthy otherwise - I don't think that makes too much sense. IMHO, we should should only panic once we are in a state, from which we can't recover anymore. A failed schema cache reload is not that. Once the view definition that causes the schema cache to fail is removed, we can safely reload the schema cache and proceed operation. Or, to put it the other way around: A panic + restart won't help solve the problem either. In most setups it will just lead to an endless restart-panic loop, too. We should not retry the schema cache load in an endless loop, once it's failed. We should stick around in an unhealthy state, report that state on the health endpoint, report the failed schema cache reload in the log once and wait for a signal / notification for reload.
Again, this was an example to point out the benefit of having the healthcheck on the main port. The idea here is, that whatever problem comes up with the main server, this will catch it. I don't think we should try to find out about every problem that can happen and fix those manually.
In this specific case it'd probably made sense to die, because recovering from a lost unix socket is done easily by restarting. However, given that this case is so unlikely I don't know whether we need to add code for that. Having the health endpoint not respond should be fine, as this will 99% be something that somebody did once and does not need to be automated.
That's why the proposal was to use a config option to set up an endpoint, which is then used as the health endpoint - this can then work the same way as the health point you have here, just on the main port.
I imagine both of that could be solved once we have JWT validation moved to a Middleware in #1988. We could then make the health endpoint just another Middleware that's loaded before anything else. This can then check if the request was made on the health endpoint and respond accordingly. If it's not, it can trigger the main app and all the other Middlewares, including the logger - this way you will have avoided logging health requests automatically, too. |
Yeah, agree. I'll improve the health check for these cases.
A disadvantage If we go this route. When Also, the admin port provides a path forward for doing #1526, which is also meant to be used in an internal network. It'd somehow feel off doing another config path for metrics. Edit: Related SO question.
True, forgot about that PR 🤦♂️, that would be a nice way to do it. That being said, I still see more benefits in doing the health check with the admin port and it's also less invasive to the main app logic. |
Want to note that I'm not opposed to a health check on a custom endpoint, I think that could be an additional feature. It would save the need for configuring a proxy for users that want to expose the health check externally. |
I think that would be a matter of blocking access to the
We could plan ahead for that and use a config option for an internal endpoint, of which
I think overall, most of the questions are just a matter of design and implementation, but don't really have any effect for the feature or user. However:
So, in the end it's just a matter of "sligthly easier to use" with the admin port or "slightly more likely to provide accurate health status" with the main port. Since the health check is mostly useful in more complex setups, adding some additional proxy config seems to be easily doable. Personally, I'd always take the more reliable approach here. |
Hm, really I tried to take the simplest approach here(the change needed for the feature is minimal) while maintaining a level of reliability we can improve further. I do assume that EMFILE is the only thing that can happen to Warp given our historic issues (listed on #2042 (comment)) and also from various instances on production I've been debugging. Thus, the health check should be "reliable"(assuming no unknown unknowns later on).
Yeah, I can reliably reproduce this with an Edit: Scratch that, it's messy to do a second worker for the schema cache load, instead I'll follow your exact suggestion. I reached the same conclusion, that's the best behavior. |
True. I wonder if it's worth to have a different response for a failed schema cache load. Similar to how Kubernetes has liveness and readiness probes. Liveness( |
Can confirm this behavior. I tested 1500 requests to a function that selects a
A new
Repeating the health checks will throw:
While new |
Hm. I think the article you cited is a pretty valuable read. I might draw a slightly different conclusion, depending on what you meant to say exactly with "Liveness(/health) could be the healthy connection".
This is basically the argument I made all along - with the difference, that I did not separate liveness from readyness. The
I think we should stop using the term
It seems the vague term of Overall, I think we should do the following, which should solve the whole discussion we had:
The The This avoids the whole problem of DDoS-ing the |
Fully agree!
I see, so this one will not check for any internal state, it will only reply with 200 to check if postgrest is "alive"(or running). What I don't see is the need for it to be only on the main app port - 3000. As Laurence confirmed above, if there's an EMFILE on port 3000, the 3001 port will also fail to respond. The disadvantages of adding an special route(mentioned above) on the main app still apply, it's more complex for the codebase. So, how about adding the
However, that could be done as a later enhancement as well, it would be useful for users that want an external live endpoint(without extra proxy config). WDYT? |
Correct.
As I said earlier, this is not only about this one specific problem. I gave another example with deleting the unix socket. Mainly this is about those things, that are still unknown to us. Those are the things that - if at all - we can only recover with a restart from. The article you cited makes the same point:
I don't know what else to say, I am convinced by my own argument ;)
Every feature adds complexity, no? Implementing the
I see a few lines of code in both cases. And I see no additional benefit compared to the |
Hm, I guess the OpenAPI output would have to consider the
True. It'd still prefer not polluting the API URL paths though, seems a lot cleaner. So I wonder if there's a way to check for the main app port internally on the admin server port. Perhaps by doing a request to a |
I wouldn't consider the
Unfortunately, we can't know for sure whether a path is non-existent without relying on the schema cache or hitting the DB. But we can't do any of the two for liveness. So even if we did it that way, we'd need a reserved path on the main port for that purpose, that is known to return 404 in all cases. The only difference would be, that the 404 for that would be rewritten to a 200. But that would still pollute the API URL path namespace in a way. So it seems a lot cleaner to just add the config option, which allows people to opt-in to this pollution and to change the name of the endpoint to a non-breaking one for their individual setup. |
Yeah, agree.
Turns out we can connect to the main app socket with raw Got the hint from this SO question and got the sample code from the intro in https://hackage.haskell.org/package/network-2.6.3.6/docs/Network-Socket.html. Will make a PR soon. Edit: Done #2109 |
I like it. I thought about just using a TCP liveness probe in kubernetes, but that wouldn't work with the main app on a unix socket, I think. You're proposal works around that nicely. |
Closes #1933.
When the
admin-server-port
config is set, it enables a<host>:<admin_server_port>/health
endpoint that replies with 200 OK when postgrest is healthy and with 503 when it's not. In both cases, the response doesn't have a body.Steps
db-channel-enabled=True
(default), do the health check based on the LISTENer state to avoid sending extra queries.db-channel-enabled=False
, resort to doing the health check with aselect 1
query.