Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit raft removed checker if raft isn't initialized #29329

Merged
merged 2 commits into from
Jan 10, 2025

Conversation

miagilepner
Copy link
Contributor

Description

Exits the checker. This prevents log spamming.

TODO only if you're a HashiCorp employee

  • Backport Labels: If this fix needs to be backported, use the appropriate backport/ label that matches the desired release branch. Note that in the CE repo, the latest release branch will look like backport/x.x.x, but older release branches will be backport/ent/x.x.x+ent.
    • LTS: If this fixes a critical security vulnerability or severity 1 bug, it will also need to be backported to the current LTS versions of Vault. To ensure this, use all available enterprise labels.
  • ENT Breakage: If this PR either 1) removes a public function OR 2) changes the signature
    of a public function, even if that change is in a CE file, double check that
    applying the patch for this PR to the ENT repo and running tests doesn't
    break any tests. Sometimes ENT only tests rely on public functions in CE
    files.
  • Jira: If this change has an associated Jira, it's referenced either
    in the PR description, commit message, or branch name.
  • RFC: If this change has an associated RFC, please link it in the description.
  • ENT PR: If this change has an associated ENT PR, please link it in the
    description. Also, make sure the changelog is in this PR, not in your ENT PR.

@miagilepner miagilepner added this to the 1.19.0-rc milestone Jan 9, 2025
@miagilepner miagilepner requested a review from bosouza January 9, 2025 16:53
@miagilepner miagilepner requested a review from a team as a code owner January 9, 2025 16:53
@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Jan 9, 2025
Copy link

github-actions bot commented Jan 9, 2025

CI Results:
All Go tests succeeded! ✅

Copy link

github-actions bot commented Jan 9, 2025

Build Results:
All builds succeeded! ✅

bosouza
bosouza previously approved these changes Jan 9, 2025
@bosouza bosouza self-requested a review January 9, 2025 19:15
@@ -1461,6 +1461,9 @@ func (b *RaftBackend) StartRemovedChecker(ctx context.Context) {
for {
select {
case <-ticker.C:
if !b.Initialized() {
Copy link
Contributor

@bosouza bosouza Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry right after approving it occurred to me that I hadn't considered how this uninitialized condition should interact with this loop, please check if my understanding is correct: this new condition !b.Initialized() won't ever be evaluated before the raft backend is initialized, so it only returns true after RaftBackend.TeardownCluster(), which gets called for example after force-restoring a snapshot. At that point the only thing that could "reinitialize" the raft backend is another call to RaftBackend.SetupCluster() but that would also start a new StartRemovedChecker so we can confidently rely on this !b.Initialized() to stop the removed checker. If that's right then my one suggestion would be to add a comment explaining that this check is not supposed to prevent the removed checker from running before the raft backend is initialized, but instead to allow it to exit cleanly after teardown of RaftBackend.

That also raises the question of what is the point of case <-ctx.Done(): if not to exit on teardown, but tracing the context all the way back it seems to just be the background context so there doesn't seem be a teardown mechanism relying on that indeed.

But I do get the feeling that I'm missing something and maybe a single instance of RaftBackend is supposed to last through multiple seal/unseal cycles, in which case the removed checker would either need a way to be restarted after unseal or remain working throughout the sealed period. I probably have a few incorrect assumptions in my reasoning, if you think it's easier to chat about it lmk!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out! I've added a comment that should hopefully provide some clarity. The raft backend will always be set up again in SetupCluster, which will make a new removed checker. The initialized check here is supposed to handle the case where the cluster has been torn down, but the context isn't closed (which, as you mention, is pretty much every case since we're using context.Background())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know, thanks for the additional details!

@@ -364,6 +364,9 @@ func TestRaftHACluster_Removed_ReAdd(t *testing.T) {
if !server.Healthy {
return fmt.Errorf("server %s is unhealthy", serverID)
}
if server.NodeType != "voter" {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't related to the PR, but I wanted to fix the race test flake. I ran locally 5 times and didn't see it fail, when previously it would fail 50% of the time locally

@miagilepner miagilepner requested a review from bosouza January 10, 2025 15:05
@miagilepner miagilepner enabled auto-merge (squash) January 10, 2025 15:34
@@ -1461,6 +1461,9 @@ func (b *RaftBackend) StartRemovedChecker(ctx context.Context) {
for {
select {
case <-ticker.C:
if !b.Initialized() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know, thanks for the additional details!

@miagilepner miagilepner merged commit dc0cd5a into main Jan 10, 2025
92 checks passed
@miagilepner miagilepner deleted the miagilepner/removed-checker-exit branch January 10, 2025 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed pr/no-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants