OCPBUGS-32105: Fix race to mark node Joined #823

zaneb · 2024-04-15T10:23:28Z

In the race between assisted-installer on the bootstrap node and assisted-installer-controller on the cluster control plane to mark nodes as Joined, a win for the assisted-installer-controller would cause the bootstrapping process to lock up for 30+ minutes.

Prevent this by not retrying HTTP requests that receive a 409 response at the HTTP transport level. Instead, retry at the logic level and avoid making requests that cannot succeed.

openshift-ci-robot · 2024-04-15T10:23:33Z

@zaneb: This pull request references Jira Issue OCPBUGS-32105, which is invalid:

expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

In the race between assisted-installer on the bootstrap node and assisted-installer-controller on the cluster control plane to mark nodes as Joined, a win for the assisted-installer-controller would cause the bootstrapping process to lock up for 30+ minutes.

Prevent this by not retrying HTTP requests that receive a 409 response at the HTTP transport level. Instead, retry at the logic level and avoid making requests that cannot succeed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

codecov · 2024-04-15T10:35:46Z

Codecov Report

Attention: Patch coverage is 75.00000% with 3 lines in your changes are missing coverage. Please review.

Project coverage is 57.32%. Comparing base (236a7a0) to head (9a809af).
Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #823      +/-   ##
==========================================
+ Coverage   54.74%   57.32%   +2.57%     
==========================================
  Files          16       16              
  Lines        3394     3803     +409     
==========================================
+ Hits         1858     2180     +322     
- Misses       1364     1416      +52     
- Partials      172      207      +35

Files	Coverage Δ
src/inventory_client/inventory_client.go	`27.36% <100.00%> (+1.03%)`	⬆️
src/installer/installer.go	`72.58% <50.00%> (+2.83%)`	⬆️

... and 2 files with indirect coverage changes

If a host is in the Installed state already (which can occur when the assisted-installer-controller sets the progress to Done), don't try to set the progress to Joined as it will not only never succeed, but also take 30+ minutes of unlogged retries inside the client before an error is returned. This narrows the window in which this can occur, but if the bootstrap assisted-installer reads the Host before the assisted-installer-controller updates the status, this could still occur. Ensure any failed requests are retried by not adding the Node to the readyMasters list until the Progress has been set to either Joined or Done (the latter triggers a change of Status to Installed). Improve debugging by not logging different request_ids for messages corresponding to a single request.

Since 4xx error codes indicate a problem on the client side, most of them cannot be usefully retried at the HTTP transport level. e.g. if a 409 Conflict is returned in response to a PUT request, then we need to fetch the resource again with a GET before creating a new PUT request. Blocking for 30+ minutes in the original PUT call (without logging) is not helpful; we want the transport to return immediately so we can try again. Retry on only those 4xx error codes where it is conceivable that trying the same request again might work.

zaneb · 2024-04-15T10:38:52Z

/cc @eranco74 @tsorya
/jira refresh

openshift-ci-robot · 2024-04-15T10:38:59Z

@zaneb: This pull request references Jira Issue OCPBUGS-32105, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/cc @eranco74 @tsorya
/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tsorya · 2024-04-15T16:02:53Z

src/installer/installer.go

@@ -776,16 +776,18 @@ func (i *installer) updateReadyMasters(nodes *v1.NodeList, readyMasters *[]strin
 			ctx := utils.GenerateRequestContext()
 			log := utils.RequestIDLogger(ctx, i.log)
 			log.Infof("Found a new ready master node %s with id %s", node.Name, node.Status.NodeInfo.SystemUUID)
-			*readyMasters = append(*readyMasters, node.Name)


I think better to leave it as is, in case we failed to update status but node is ready we better to exit and continue installation, no? Nothing critical in not setting Joined state it will be handle in controller afterwards

Yeah, it's not as big a deal as I originally thought, because the controller will handle it anyway. But this does guarantee that everything is in the state we expect before we carry on to other work in here. It's still robust against unexpected nodes joining, and if the controller does win the race then this will still work on the next attempt.

tsorya · 2024-04-15T16:03:35Z

src/installer/installer.go

 			if !ok {
 				return fmt.Errorf("node %s is not in inventory hosts", node.Name)
 			}
-			ctx = utils.GenerateRequestContext()


Don't we want to set request id?

It's already set on line 776. We don't need to generate another one, we haven't even made a request with the first one yet. And this made it really hard to debug, since the request ID that showed up in the assisted-service log never appeared in the logs here.

zaneb · 2024-04-28T23:48:28Z

/retest-required

tsorya · 2024-04-29T09:40:02Z

/lgtm

tsorya · 2024-04-29T09:40:22Z

/approve

openshift-ci · 2024-04-29T09:42:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tsorya, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tsorya]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2024-04-29T11:26:12Z

/retest-required

Remaining retests: 0 against base HEAD 27e1b0d and 2 for PR HEAD 9a809af in total

openshift-ci-robot · 2024-05-02T10:24:02Z

/retest-required

Remaining retests: 0 against base HEAD 504ae08 and 1 for PR HEAD 9a809af in total

openshift-ci · 2024-05-02T15:04:40Z

@zaneb: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2024-05-02T15:12:55Z

@zaneb: Jira Issue OCPBUGS-32105: All pull requests linked via external trackers have merged:

openshift/assisted-installer#823

Jira Issue OCPBUGS-32105 has been moved to the MODIFIED state.

In response to this:

In the race between assisted-installer on the bootstrap node and assisted-installer-controller on the cluster control plane to mark nodes as Joined, a win for the assisted-installer-controller would cause the bootstrapping process to lock up for 30+ minutes.

Prevent this by not retrying HTTP requests that receive a 409 response at the HTTP transport level. Instead, retry at the logic level and avoid making requests that cannot succeed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-05-02T20:41:28Z

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-agent-installer-csr-approver-container-v4.16.0-202405021917.p0.gf548d32.assembly.stream.el9 for distgit ose-agent-installer-csr-approver.
All builds following this will include this PR.

openshift-bot · 2024-05-02T20:41:29Z

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-agent-installer-orchestrator-container-v4.16.0-202405021917.p0.gf548d32.assembly.stream.el9 for distgit ose-agent-installer-orchestrator.
All builds following this will include this PR.

openshift-merge-robot · 2024-05-03T14:49:12Z

Fix included in accepted release 4.16.0-0.nightly-2024-05-03-091818

zaneb · 2024-06-21T02:10:55Z

/cherry-pick release-4.15

openshift-cherrypick-robot · 2024-06-21T02:11:40Z

@zaneb: new pull request created: #859

In response to this:

/cherry-pick release-4.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

carbonin · 2024-09-10T15:11:00Z

/cherry-pick release-ocm-2.10

openshift-cherrypick-robot · 2024-09-10T15:11:45Z

@carbonin: new pull request created: #900

In response to this:

/cherry-pick release-ocm-2.10

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 15, 2024

openshift-ci bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 15, 2024

openshift-ci bot requested review from danielerez and tsorya April 15, 2024 10:28

zaneb added 2 commits April 15, 2024 22:36

zaneb force-pushed the node-ready-race branch from 4a91060 to 9a809af Compare April 15, 2024 10:36

openshift-ci bot requested a review from eranco74 April 15, 2024 10:38

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 15, 2024

tsorya reviewed Apr 15, 2024

View reviewed changes

openshift-ci bot assigned tsorya Apr 29, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 29, 2024

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 29, 2024

openshift-merge-bot bot merged commit f548d32 into openshift:master May 2, 2024
10 checks passed

openshift-cherrypick-robot mentioned this pull request Jun 21, 2024

[release-4.15] OCPBUGS-35894: Fix race to mark node Joined #859

Merged

openshift-cherrypick-robot mentioned this pull request Sep 10, 2024

[release-ocm-2.10] MGMT-18868: Fix race to mark node Joined #900

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-32105: Fix race to mark node Joined #823

OCPBUGS-32105: Fix race to mark node Joined #823

zaneb commented Apr 15, 2024

openshift-ci-robot commented Apr 15, 2024

codecov bot commented Apr 15, 2024 •

edited

Loading

zaneb commented Apr 15, 2024

openshift-ci-robot commented Apr 15, 2024

tsorya Apr 15, 2024

zaneb Apr 15, 2024

tsorya Apr 15, 2024

zaneb Apr 15, 2024

zaneb commented Apr 28, 2024

tsorya commented Apr 29, 2024

tsorya commented Apr 29, 2024

openshift-ci bot commented Apr 29, 2024

openshift-ci-robot commented Apr 29, 2024

openshift-ci-robot commented May 2, 2024

openshift-ci bot commented May 2, 2024

openshift-ci-robot commented May 2, 2024

openshift-bot commented May 2, 2024

openshift-bot commented May 2, 2024

openshift-merge-robot commented May 3, 2024

zaneb commented Jun 21, 2024

openshift-cherrypick-robot commented Jun 21, 2024

carbonin commented Sep 10, 2024

openshift-cherrypick-robot commented Sep 10, 2024

OCPBUGS-32105: Fix race to mark node Joined #823

OCPBUGS-32105: Fix race to mark node Joined #823

Conversation

zaneb commented Apr 15, 2024

openshift-ci-robot commented Apr 15, 2024

codecov bot commented Apr 15, 2024 • edited Loading

Codecov Report

zaneb commented Apr 15, 2024

openshift-ci-robot commented Apr 15, 2024

tsorya Apr 15, 2024

Choose a reason for hiding this comment

zaneb Apr 15, 2024

Choose a reason for hiding this comment

tsorya Apr 15, 2024

Choose a reason for hiding this comment

zaneb Apr 15, 2024

Choose a reason for hiding this comment

zaneb commented Apr 28, 2024

tsorya commented Apr 29, 2024

tsorya commented Apr 29, 2024

openshift-ci bot commented Apr 29, 2024

openshift-ci-robot commented Apr 29, 2024

openshift-ci-robot commented May 2, 2024

openshift-ci bot commented May 2, 2024

openshift-ci-robot commented May 2, 2024

openshift-bot commented May 2, 2024

openshift-bot commented May 2, 2024

openshift-merge-robot commented May 3, 2024

zaneb commented Jun 21, 2024

openshift-cherrypick-robot commented Jun 21, 2024

carbonin commented Sep 10, 2024

openshift-cherrypick-robot commented Sep 10, 2024

codecov bot commented Apr 15, 2024 •

edited

Loading