-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exit raft removed checker if raft isn't initialized #29329
Conversation
CI Results: |
Build Results: |
@@ -1461,6 +1461,9 @@ func (b *RaftBackend) StartRemovedChecker(ctx context.Context) { | |||
for { | |||
select { | |||
case <-ticker.C: | |||
if !b.Initialized() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry right after approving it occurred to me that I hadn't considered how this uninitialized condition should interact with this loop, please check if my understanding is correct: this new condition !b.Initialized()
won't ever be evaluated before the raft backend is initialized, so it only returns true
after RaftBackend.TeardownCluster()
, which gets called for example after force-restoring a snapshot. At that point the only thing that could "reinitialize" the raft backend is another call to RaftBackend.SetupCluster()
but that would also start a new StartRemovedChecker
so we can confidently rely on this !b.Initialized()
to stop the removed checker. If that's right then my one suggestion would be to add a comment explaining that this check is not supposed to prevent the removed checker from running before the raft backend is initialized, but instead to allow it to exit cleanly after teardown of RaftBackend
.
That also raises the question of what is the point of case <-ctx.Done():
if not to exit on teardown, but tracing the context all the way back it seems to just be the background context so there doesn't seem be a teardown mechanism relying on that indeed.
But I do get the feeling that I'm missing something and maybe a single instance of RaftBackend
is supposed to last through multiple seal/unseal cycles, in which case the removed checker would either need a way to be restarted after unseal or remain working throughout the sealed period. I probably have a few incorrect assumptions in my reasoning, if you think it's easier to chat about it lmk!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call out! I've added a comment that should hopefully provide some clarity. The raft backend will always be set up again in SetupCluster, which will make a new removed checker. The initialized check here is supposed to handle the case where the cluster has been torn down, but the context isn't closed (which, as you mention, is pretty much every case since we're using context.Background())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know, thanks for the additional details!
@@ -364,6 +364,9 @@ func TestRaftHACluster_Removed_ReAdd(t *testing.T) { | |||
if !server.Healthy { | |||
return fmt.Errorf("server %s is unhealthy", serverID) | |||
} | |||
if server.NodeType != "voter" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't related to the PR, but I wanted to fix the race test flake. I ran locally 5 times and didn't see it fail, when previously it would fail 50% of the time locally
@@ -1461,6 +1461,9 @@ func (b *RaftBackend) StartRemovedChecker(ctx context.Context) { | |||
for { | |||
select { | |||
case <-ticker.C: | |||
if !b.Initialized() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know, thanks for the additional details!
Description
Exits the checker. This prevents log spamming.
TODO only if you're a HashiCorp employee
backport/
label that matches the desired release branch. Note that in the CE repo, the latest release branch will look likebackport/x.x.x
, but older release branches will bebackport/ent/x.x.x+ent
.of a public function, even if that change is in a CE file, double check that
applying the patch for this PR to the ENT repo and running tests doesn't
break any tests. Sometimes ENT only tests rely on public functions in CE
files.
in the PR description, commit message, or branch name.
description. Also, make sure the changelog is in this PR, not in your ENT PR.