Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to skip blocking pod startup if driver is not ready to create a request yet #20

Merged
merged 6 commits into from
Mar 29, 2022

Conversation

munnerz
Copy link
Member

@munnerz munnerz commented Feb 18, 2022

This will help facilitate cert-manager/csi-driver#17 and any other uses where we want to permit the pod to startup even if the CSI driver is not yet ready.

This feature comes with the caveat that user applications/pods MUST be designed to tolerate certificate/private key data not being available when the pod first starts up (as the driver will no longer block pod startup until the volume is ready to request a certificate).

@munnerz munnerz requested a review from JoshVanL February 18, 2022 15:23
@jetstack-bot jetstack-bot added the dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. label Feb 18, 2022
@jetstack-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: munnerz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 18, 2022
manager/interfaces.go Outdated Show resolved Hide resolved
Copy link
Contributor

@7ing 7ing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, thanks @munnerz for this effort.

@@ -146,5 +146,10 @@ func (m *MemoryFS) ReadFiles(volumeID string) (map[string][]byte, error) {
if !ok {
return nil, ErrNotFound
}
return vol, nil
// make a copy of the map to ensure no races can occur
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly using sync.RWMutex and RLock() to avoid the read race condition ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we return this map, it's not possible to do so as we can't enforce call-sites create a mutex. To push this onus onto the caller creates a fair bit of extra complexity and I'm not convinced that outweighs the performance gains.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got your idea. thanks.

Comment on lines 88 to 96
// Only wait for the volume to be ready if it is in a state of 'ready to request'
// already. This allows implementors to defer actually requesting certificates
// until later in the pod lifecycle (e.g. after CNI has run & an IP address has been
// allocated, if a user wants to embed pod IPs into their requests).
isReadyToRequest := ns.manager.IsVolumeReadyToRequest(req.GetVolumeId())
if isReadyToRequest || !ns.continueOnNotReady {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we simplify the logic here that, wait for volumeReady only if continueOnNotReady is false ?
Here, if a driver use default AlwaysReadyToRequest func, isReadyToRequest may always return true. In this case, continueOnNotReady becomes useless ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - the logic is that if we are ready to request straight away, we wait/block. This was intentional so that we only skip waiting if we aren't 'ready' to wait (and it's been configured to skip in these cases)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. okay to me.

Copy link
Contributor

@JoshVanL JoshVanL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR makes sense to me and gives users a way to begin implementing these features which are blocked on kube order of operation.

The main thing that sticks out to me is that there is no way for a ReadyToRequest to signal early that it is ready, and instead has to wait for the check at the next backoff step (maximum of 1 min currently). A lack of this signal does potentially increase the issue time significantly, being exponential. Just missing the ~30 sec step for example, and then having to wait for another ~minute is quite painful. Even the fact it starts at 2 seconds could be quite significant when added up over a large number of volumes/pods.

Perhaps this is an non-issue as ReadyToRequestFunc is expected to return true after a brief period anyway, but something to keep in mind.

manager/interfaces.go Outdated Show resolved Hide resolved
test/integration/ready_to_request_test.go Outdated Show resolved Hide resolved
@munnerz munnerz force-pushed the continue-on-error branch 3 times, most recently from 38e3cb2 to 24ecb2c Compare March 24, 2022 14:23
@munnerz munnerz force-pushed the continue-on-error branch from 24ecb2c to e152da4 Compare March 29, 2022 08:49
@JoshVanL
Copy link
Contributor

PR is looking good to me!

/lgtrm

@JoshVanL
Copy link
Contributor

/lgtm

@jetstack-bot jetstack-bot added the lgtm Indicates that a PR is ready to be merged. label Mar 29, 2022
@jetstack-bot jetstack-bot merged commit 69abbbc into cert-manager:main Mar 29, 2022
@munnerz munnerz deleted the continue-on-error branch March 29, 2022 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. dco-signoff: yes Indicates that all commits in the pull request have the valid DCO sign-off message. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants