Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert influx lib upgrade; Influx smoke test before listening on http port. #307

Merged
merged 4 commits into from
Oct 1, 2024

Conversation

alsuren
Copy link
Collaborator

@alsuren alsuren commented Oct 1, 2024

This reverts #303 because it is causing the following in our logs: https://fly.io/apps/cargo-quickinstall-stats-server/monitoring

2024-10-01T15:14:50.843 app[7811ee7a2566e8] iad [info] called `Result::unwrap()` on an `Err` value: NonSuccessResponse(400, "{\"code\":\"invalid\",\"message\":\"no data written, errors encountered on line(s): line 1: timestamp value overflows after adjusting for specified precision\",\"line\":1}")

I'm not in the mood to debug it right now, but I think that if the new docker container fails to come up, the old one should stay in the load balancer and prevent an outage (note: this is not the case with the default strategy, but it is with the bluegreen strategy).

  • test this theory by trying to deploy the broken code

With the healthcheck and a bluegreen deployment strategy, we get a nonzero exit code from flyctl deploy, and this in its output:

Updating existing machines in 'cargo-quickinstall-stats-server' with bluegreen strategy

Verifying if app can be safely deployed 

Creating green machines
  Created machine 080e3e1a021098 [app]

Waiting for all green machines to start
  Machine 080e3e1a021098 [app] - started

Waiting for all green machines to be healthy
  Machine 080e3e1a021098 [app] - unchecked

Rolling back failed deployment
  Deleted machine 080e3e1a021098 [app]
Error: wait timeout
could not get all green machines to be healthy

This is enough to leave production in a healthy state and give us a build failure notification on the merge-to-main github action.

Unfortunately, it leaves main in a broken state, and requires a maintainer with access to https://fly.io/apps/cargo-quickinstall-stats-server/monitoring to debug the issue.

Ideally, we would also have an integration test that spins up an influxdb 3 server in a docker container and tries to report stats to it. Unfortunately there are no influxdb3 docker images yet. In practice we could probably use an influxdb 2 image, because the stats reporting wire format shouldn't have changed.

Let's just get the stats server back on its feet for now.

@alsuren alsuren marked this pull request as ready for review October 1, 2024 15:47
@alsuren alsuren requested a review from NobodyXu October 1, 2024 15:47
@NobodyXu NobodyXu merged commit 557986f into main Oct 1, 2024
33 checks passed
@NobodyXu NobodyXu deleted the influx-healthchecks branch October 1, 2024 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants