Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-32105: Fix race to mark node Joined #823

Merged
merged 2 commits into from
May 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions src/installer/installer.go
Original file line number Diff line number Diff line change
Expand Up @@ -776,16 +776,18 @@ func (i *installer) updateReadyMasters(nodes *v1.NodeList, readyMasters *[]strin
ctx := utils.GenerateRequestContext()
log := utils.RequestIDLogger(ctx, i.log)
log.Infof("Found a new ready master node %s with id %s", node.Name, node.Status.NodeInfo.SystemUUID)
*readyMasters = append(*readyMasters, node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to leave it as is, in case we failed to update status but node is ready we better to exit and continue installation, no? Nothing critical in not setting Joined state it will be handle in controller afterwards

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not as big a deal as I originally thought, because the controller will handle it anyway. But this does guarantee that everything is in the state we expect before we carry on to other work in here. It's still robust against unexpected nodes joining, and if the controller does win the race then this will still work on the next attempt.


host, ok := common.HostMatchByNameOrIPAddress(node, inventoryHostsMap, knownIpAddresses)
if ok && (host.Host.Status == nil || *host.Host.Status != models.HostStatusInstalled) {
if err := i.inventoryClient.UpdateHostInstallProgress(ctx, host.Host.InfraEnvID.String(), host.Host.ID.String(), models.HostStageJoined, ""); err != nil {
log.Errorf("Failed to update node installation status, %s", err)
continue
}
}
*readyMasters = append(*readyMasters, node.Name)
if !ok {
return fmt.Errorf("node %s is not in inventory hosts", node.Name)
}
ctx = utils.GenerateRequestContext()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to set request id?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already set on line 776. We don't need to generate another one, we haven't even made a request with the first one yet. And this made it really hard to debug, since the request ID that showed up in the assisted-service log never appeared in the logs here.

if err := i.inventoryClient.UpdateHostInstallProgress(ctx, host.Host.InfraEnvID.String(), host.Host.ID.String(), models.HostStageJoined, ""); err != nil {
utils.RequestIDLogger(ctx, i.log).Errorf("Failed to update node installation status, %s", err)
}
}
}

Expand Down
8 changes: 6 additions & 2 deletions src/inventory_client/inventory_client.go
Original file line number Diff line number Diff line change
Expand Up @@ -120,11 +120,15 @@ func CreateInventoryClientWithDelay(clusterId string, inventoryURL string, pullS
rehttp.RetryAny(
rehttp.RetryAll(
rehttp.RetryMaxRetries(minRetries),
rehttp.RetryStatusInterval(400, 404),
rehttp.RetryStatuses(404, 423, 425),
),
rehttp.RetryAll(
rehttp.RetryMaxRetries(maxRetries),
rehttp.RetryStatusInterval(405, 600),
rehttp.RetryStatuses(408, 429),
),
rehttp.RetryAll(
rehttp.RetryMaxRetries(maxRetries),
rehttp.RetryStatusInterval(500, 600),
),
rehttp.RetryAll(
rehttp.RetryMaxRetries(maxRetries),
Expand Down