Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit raft removed checker if raft isn't initialized #29329

Merged
merged 2 commits into from
Jan 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions physical/raft/raft.go
Original file line number Diff line number Diff line change
Expand Up @@ -1461,6 +1461,19 @@ func (b *RaftBackend) StartRemovedChecker(ctx context.Context) {
for {
select {
case <-ticker.C:
// If the raft cluster has been torn down (which will happen on
// seal) the raft backend will be uninitialized. We want to exit
// the loop in that case. If the cluster unseals, we'll get a
// new backend setup and that will have its own removed checker.

// There is a ctx.Done() check below that will also exit, but
// in most (if not all) places we pass in context.Background()
// to this function. Checking initialization will prevent this
// loop from continuing to run after the raft backend is stopped
// regardless of the context.
if !b.Initialized() {
Copy link
Contributor

@bosouza bosouza Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry right after approving it occurred to me that I hadn't considered how this uninitialized condition should interact with this loop, please check if my understanding is correct: this new condition !b.Initialized() won't ever be evaluated before the raft backend is initialized, so it only returns true after RaftBackend.TeardownCluster(), which gets called for example after force-restoring a snapshot. At that point the only thing that could "reinitialize" the raft backend is another call to RaftBackend.SetupCluster() but that would also start a new StartRemovedChecker so we can confidently rely on this !b.Initialized() to stop the removed checker. If that's right then my one suggestion would be to add a comment explaining that this check is not supposed to prevent the removed checker from running before the raft backend is initialized, but instead to allow it to exit cleanly after teardown of RaftBackend.

That also raises the question of what is the point of case <-ctx.Done(): if not to exit on teardown, but tracing the context all the way back it seems to just be the background context so there doesn't seem be a teardown mechanism relying on that indeed.

But I do get the feeling that I'm missing something and maybe a single instance of RaftBackend is supposed to last through multiple seal/unseal cycles, in which case the removed checker would either need a way to be restarted after unseal or remain working throughout the sealed period. I probably have a few incorrect assumptions in my reasoning, if you think it's easier to chat about it lmk!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out! I've added a comment that should hopefully provide some clarity. The raft backend will always be set up again in SetupCluster, which will make a new removed checker. The initialized check here is supposed to handle the case where the cluster has been torn down, but the context isn't closed (which, as you mention, is pretty much every case since we're using context.Background())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know, thanks for the additional details!

return
}
removed, err := b.IsNodeRemoved(ctx, b.localID)
if err != nil {
logger.Error("failed to check if node is removed", "node ID", b.localID, "error", err)
Expand Down
3 changes: 3 additions & 0 deletions vault/external_tests/raftha/raft_ha_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,9 @@ func TestRaftHACluster_Removed_ReAdd(t *testing.T) {
if !server.Healthy {
return fmt.Errorf("server %s is unhealthy", serverID)
}
if server.NodeType != "voter" {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't related to the PR, but I wanted to fix the race test flake. I ran locally 5 times and didn't see it fail, when previously it would fail 50% of the time locally

return fmt.Errorf("server %s has type %s", serverID, server.NodeType)
}
}
return nil
})
Expand Down
Loading