-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configurable minimum worker nodecount #238
base: main
Are you sure you want to change the base?
Configurable minimum worker nodecount #238
Conversation
/test 4.15-openshift-e2e |
Hi @novasbc. Thanks for your PR. I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@novasbc: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
e14b8aa
to
af2b099
Compare
Hi, do you mind extending the description please? What's the issue, how do you fix it, how do you test the changes... |
af2b099
to
2beddcb
Compare
@slintes I updated the description, included the issue # as well. Also, fixed the build which was failing with 'make verify-bundle', because the bundle hadn't been updated. |
Thanks! /test 4.16-openshift-e2e |
2beddcb
to
a99eed1
Compare
fixed an issue which was causing a failure with make test, regarding rebooter being nil |
/test 4.15-openshift-e2e |
/test 4.16-openshift-e2e |
/test 4.15-openshift-e2e |
/test 4.13-openshift-e2e |
1 similar comment
/test 4.13-openshift-e2e |
I looked into the e2e failures reported over the past few days and realized that it was due to temporary/environmental issues. When I re-ran they started passing better. We can't run the tests in an openshift environment, so weren't seeing the same things locally. Anyhow, I believe this is ready for review. Thanks! |
pkg/apicheck/check.go
Outdated
if peersToAsk == nil || len(peersToAsk) == 0 { | ||
c.config.Log.Info("Peers list is empty and / or couldn't be retrieved from server, nothing we can do, so consider the node being healthy") | ||
// TODO: maybe we need to check if this happens too much and reboot | ||
if peersToAsk == nil && c.config.MinPeersForRemediation != 0 || len(peersToAsk) < c.config.MinPeersForRemediation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit tricky, but as len(peersToAsk)
is zero if peersToAsk
is nil
, I think you can get rid of the first part and just use len(peersToAsk) < c.config.MinPeersForRemediation
.
- if
peersToAsk == nil
(and solen(...) == 0
) andc.config.MinPeersForRemediation != 0
, then alsolen(peersToAsk) < c.config.MinPeersForRemediation
is True. - For all the other combinations, we always need to evaluate the part after
||
anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did update this to try and make it more clear, and added more comments - and removed an unnecessary check.
7ecb865
to
9ac1e00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, a unit test for this would be nice. I'm wondering if it's possible to add a test case similar to this one, but without peer and minPeers configured?
bundle/manifests/self-node-remediation.clusterserviceversion.yaml
Outdated
Show resolved
Hide resolved
9ac1e00
to
117656c
Compare
/retest |
We have been unable to properly run e2e tests like the one you referenced in our infrastructure (not openshift) - we spent some decent cycles trying to get it to work. Are you actually suggesting a unit test, or e2e? We did update one of the config unit tests, but that was super trivial & minor. |
If it's the case of an e2e test, we can try and write it and submit and run it through your infrastructure and some iterations - I just expect that to take quite some time to get right w/o the ability to run and debug locally. I'd certainly be willing to have our team put some more effort getting it running - but would love to potentially get this set of changes pushed upstream so that we are no longer using our locally forked version - which makes build & test additionally difficult. |
/test 4.12-openshift-e2e |
@novasbc: The specified target(s) for
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retest |
I wrote unit test, and pointed to a unit test, so this is a yes :)
And we would like to have a test that verifies that at least setting |
Apologies, when I looked at the linked code, I thought it was actually executing as an e2e test, will look more closely. Locally, they were not running with the unit tests with make test |
Confirmed, I can get a unit test running in that context, it was misunderstanding of what was actually going on with the linked context. |
With this one can specify the number of worker peers needed to be able to contact before determining a node is unhealthy. It covers the case in which there are 3 control plane nodes and a single worker node, and yet you still want to be able to perform remediations on that worker node It has a default of 1, which maintains existing behaviors without explicitly altering the value.
117656c
to
627c3bb
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: novasbc The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@novasbc: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
@@ -445,22 +445,88 @@ var _ = Describe("SNR Controller", func() { | |||
remediationStrategy = v1alpha1.ResourceDeletionRemediationStrategy | |||
}) | |||
|
|||
It("Verify that watchdog is not receiving food after some time", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, any particular reason this test was removed ?
}, 10*shared.PeerUpdateInterval, timeout).Should(BeTrue()) | ||
AfterEach(func() { | ||
By("Restore default settings") | ||
apiConnectivityCheckConfig.MinPeersForRemediation = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think this makes more sense in the general After Each block
apiConnectivityCheckConfig.MinPeersForRemediation = 1 | ||
|
||
// sleep so config can update | ||
time.Sleep(time.Second * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed, IIUC config isn't being updated async here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that if I didn't put the sleep often times the config was not updated when the other routines ran. It took me quite some time to track this particular problem down 😭
It's even worse if I disable some tests so things run in different orders
|
||
Context("no peer found, and using default setting for MinPeersForRemediation", func() { | ||
BeforeEach(func() { | ||
apiConnectivityCheckConfig.MinPeersForRemediation = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is redundant because was already updated in the AfterEach block
Context("no peer found, and using default setting for MinPeersForRemediation", func() { | ||
BeforeEach(func() { | ||
apiConnectivityCheckConfig.MinPeersForRemediation = 1 | ||
snrConfig.Spec.MinPeersForRemediation = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is the default value, so I think it would make more sense to add it in GenerateTestConfig
@@ -203,6 +205,134 @@ var _ = BeforeSuite(func() { | |||
|
|||
}) | |||
|
|||
//var _ = BeforeEach(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftovers ? probably can be removed
Why we need this PR
Existing code requires there to be at least one other peer worker node before remediation can occur, precluding SNR from remediating on a configuration with 3 control plane nodes + 1 worker node, which is a scenario that we support for bare minimum deployments.
Changes made
Which issue(s) this PR fixes
Fixes #213
Test plan