Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable minimum worker nodecount #238

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

novasbc
Copy link

@novasbc novasbc commented Oct 2, 2024

Why we need this PR

Existing code requires there to be at least one other peer worker node before remediation can occur, precluding SNR from remediating on a configuration with 3 control plane nodes + 1 worker node, which is a scenario that we support for bare minimum deployments.

Changes made

  • Add minPeersForRemediation configuration value. It defaults to 1, which maintains backward compatibility with existing deployments
  • Update getWorkerPeersResponse to take into account the new configuration value and not fail when there isn't another peer, and the user has configured the minimum to zero

Which issue(s) this PR fixes

Fixes #213

Test plan

@novasbc
Copy link
Author

novasbc commented Oct 2, 2024

/test 4.15-openshift-e2e

Copy link
Contributor

openshift-ci bot commented Oct 2, 2024

Hi @novasbc. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

openshift-ci bot commented Oct 2, 2024

@novasbc: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test 4.15-openshift-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@novasbc novasbc force-pushed the configurable_minimum_worker_nodecount_2024-10-02 branch from e14b8aa to af2b099 Compare October 2, 2024 17:40
@slintes
Copy link
Member

slintes commented Oct 8, 2024

Hi, do you mind extending the description please? What's the issue, how do you fix it, how do you test the changes...
Also, please check the failed test.
Thanks

@novasbc novasbc force-pushed the configurable_minimum_worker_nodecount_2024-10-02 branch from af2b099 to 2beddcb Compare October 16, 2024 14:54
@novasbc
Copy link
Author

novasbc commented Oct 16, 2024

Hi, do you mind extending the description please? What's the issue, how do you fix it, how do you test the changes... Also, please check the failed test. Thanks

@slintes I updated the description, included the issue # as well.

Also, fixed the build which was failing with 'make verify-bundle', because the bundle hadn't been updated.

@slintes
Copy link
Member

slintes commented Oct 17, 2024

Thanks!

/test 4.16-openshift-e2e

@novasbc novasbc force-pushed the configurable_minimum_worker_nodecount_2024-10-02 branch from 2beddcb to a99eed1 Compare October 17, 2024 20:38
@novasbc
Copy link
Author

novasbc commented Oct 17, 2024

fixed an issue which was causing a failure with make test, regarding rebooter being nil

@novasbc novasbc changed the title Configurable minimum worker nodecount 2024 10 02 [WIP] Configurable minimum worker nodecount 2024 10 02 Oct 18, 2024
@novasbc
Copy link
Author

novasbc commented Oct 18, 2024

/test 4.15-openshift-e2e

@novasbc
Copy link
Author

novasbc commented Oct 18, 2024

/test 4.16-openshift-e2e

@novasbc novasbc changed the title [WIP] Configurable minimum worker nodecount 2024 10 02 Configurable minimum worker nodecount Oct 22, 2024
@novasbc novasbc marked this pull request as ready for review October 22, 2024 14:50
@openshift-ci openshift-ci bot requested review from mshitrit and razo7 October 22, 2024 14:50
@novasbc
Copy link
Author

novasbc commented Oct 22, 2024

/test 4.15-openshift-e2e

@novasbc
Copy link
Author

novasbc commented Oct 22, 2024

/test 4.13-openshift-e2e

1 similar comment
@novasbc
Copy link
Author

novasbc commented Oct 23, 2024

/test 4.13-openshift-e2e

@novasbc
Copy link
Author

novasbc commented Oct 23, 2024

@razo7 @mshitrit

I looked into the e2e failures reported over the past few days and realized that it was due to temporary/environmental issues. When I re-ran they started passing better. We can't run the tests in an openshift environment, so weren't seeing the same things locally.

Anyhow, I believe this is ready for review.

Thanks!

Makefile Show resolved Hide resolved
pkg/apicheck/check.go Outdated Show resolved Hide resolved
if peersToAsk == nil || len(peersToAsk) == 0 {
c.config.Log.Info("Peers list is empty and / or couldn't be retrieved from server, nothing we can do, so consider the node being healthy")
// TODO: maybe we need to check if this happens too much and reboot
if peersToAsk == nil && c.config.MinPeersForRemediation != 0 || len(peersToAsk) < c.config.MinPeersForRemediation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit tricky, but as len(peersToAsk) is zero if peersToAsk is nil, I think you can get rid of the first part and just use len(peersToAsk) < c.config.MinPeersForRemediation.

  • if peersToAsk == nil (and so len(...) == 0) and c.config.MinPeersForRemediation != 0, then also len(peersToAsk) < c.config.MinPeersForRemediation is True.
  • For all the other combinations, we always need to evaluate the part after || anyway

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did update this to try and make it more clear, and added more comments - and removed an unnecessary check.

pkg/apicheck/check.go Outdated Show resolved Hide resolved
api/v1alpha1/selfnoderemediationconfig_types.go Outdated Show resolved Hide resolved
api/v1alpha1/selfnoderemediationconfig_types.go Outdated Show resolved Hide resolved
pkg/apicheck/check.go Outdated Show resolved Hide resolved
pkg/apicheck/check.go Outdated Show resolved Hide resolved
pkg/apicheck/check.go Outdated Show resolved Hide resolved
@novasbc novasbc force-pushed the configurable_minimum_worker_nodecount_2024-10-02 branch 2 times, most recently from 7ecb865 to 9ac1e00 Compare November 6, 2024 20:14
Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, a unit test for this would be nice. I'm wondering if it's possible to add a test case similar to this one, but without peer and minPeers configured?

https://github.com/medik8s/self-node-remediation/blob/main/controllers/tests/controller/selfnoderemediation_controller_test.go#L438-L464

self-node-remediation.iml Outdated Show resolved Hide resolved
@novasbc novasbc force-pushed the configurable_minimum_worker_nodecount_2024-10-02 branch from 9ac1e00 to 117656c Compare November 18, 2024 15:04
@novasbc
Copy link
Author

novasbc commented Nov 19, 2024

/retest

@novasbc
Copy link
Author

novasbc commented Nov 19, 2024

@slintes

hm, a unit test for this would be nice. I'm wondering if it's possible to add a test case similar to this one, but without peer and minPeers configured?

https://github.com/medik8s/self-node-remediation/blob/main/controllers/tests/controller/selfnoderemediation_controller_test.go#L438-L464

We have been unable to properly run e2e tests like the one you referenced in our infrastructure (not openshift) - we spent some decent cycles trying to get it to work.

Are you actually suggesting a unit test, or e2e? We did update one of the config unit tests, but that was super trivial & minor.

@novasbc
Copy link
Author

novasbc commented Nov 19, 2024

If it's the case of an e2e test, we can try and write it and submit and run it through your infrastructure and some iterations - I just expect that to take quite some time to get right w/o the ability to run and debug locally.

I'd certainly be willing to have our team put some more effort getting it running - but would love to potentially get this set of changes pushed upstream so that we are no longer using our locally forked version - which makes build & test additionally difficult.

@novasbc
Copy link
Author

novasbc commented Nov 19, 2024

/test 4.12-openshift-e2e

Copy link
Contributor

openshift-ci bot commented Nov 19, 2024

@novasbc: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test 4.14-ci-bundle-self-node-remediation-bundle
  • /test 4.14-images
  • /test 4.14-openshift-e2e
  • /test 4.14-test
  • /test 4.15-ci-bundle-self-node-remediation-bundle
  • /test 4.15-images
  • /test 4.15-openshift-e2e
  • /test 4.15-test
  • /test 4.16-ci-bundle-self-node-remediation-bundle
  • /test 4.16-images
  • /test 4.16-openshift-e2e
  • /test 4.16-test
  • /test 4.17-ci-bundle-self-node-remediation-bundle
  • /test 4.17-images
  • /test 4.17-openshift-e2e
  • /test 4.17-test
  • /test 4.18-ci-bundle-self-node-remediation-bundle
  • /test 4.18-images
  • /test 4.18-openshift-e2e
  • /test 4.18-test

Use /test all to run all jobs.

In response to this:

/test 4.12-openshift-e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@novasbc
Copy link
Author

novasbc commented Nov 19, 2024

/retest

@slintes
Copy link
Member

slintes commented Nov 20, 2024

Are you actually suggesting a unit test

I wrote unit test, and pointed to a unit test, so this is a yes :)

but would love to potentially get this set of changes pushed upstream

And we would like to have a test that verifies that at least setting minPeersForRemediation to 0 has the desired effect.

@novasbc
Copy link
Author

novasbc commented Nov 20, 2024

Are you actually suggesting a unit test

I wrote unit test, and pointed to a unit test, so this is a yes :)

but would love to potentially get this set of changes pushed upstream

And we would like to have a test that verifies that at least setting minPeersForRemediation to 0 has the desired effect.

Apologies, when I looked at the linked code, I thought it was actually executing as an e2e test, will look more closely. Locally, they were not running with the unit tests with make test

@novasbc
Copy link
Author

novasbc commented Nov 20, 2024

Confirmed, I can get a unit test running in that context, it was misunderstanding of what was actually going on with the linked context.

With this one can specify the number of worker peers needed to
be able to contact before determining a node is unhealthy.

It covers the case in which there are 3 control plane nodes and a single
worker node, and yet you still want to be able to perform remediations
on that worker node

It has a default of 1, which maintains existing behaviors without
explicitly altering the value.
@novasbc novasbc force-pushed the configurable_minimum_worker_nodecount_2024-10-02 branch from 117656c to 627c3bb Compare January 8, 2025 19:17
Copy link
Contributor

openshift-ci bot commented Jan 8, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: novasbc
Once this PR has been reviewed and has the lgtm label, please ask for approval from clobrano. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Jan 8, 2025

@novasbc: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.12-openshift-e2e 9ac1e00 link true /test 4.12-openshift-e2e
ci/prow/4.14-openshift-e2e 627c3bb link true /test 4.14-openshift-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@@ -445,22 +445,88 @@ var _ = Describe("SNR Controller", func() {
remediationStrategy = v1alpha1.ResourceDeletionRemediationStrategy
})

It("Verify that watchdog is not receiving food after some time", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, any particular reason this test was removed ?

}, 10*shared.PeerUpdateInterval, timeout).Should(BeTrue())
AfterEach(func() {
By("Restore default settings")
apiConnectivityCheckConfig.MinPeersForRemediation = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this makes more sense in the general After Each block

apiConnectivityCheckConfig.MinPeersForRemediation = 1

// sleep so config can update
time.Sleep(time.Second * 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed, IIUC config isn't being updated async here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that if I didn't put the sleep often times the config was not updated when the other routines ran. It took me quite some time to track this particular problem down 😭

It's even worse if I disable some tests so things run in different orders


Context("no peer found, and using default setting for MinPeersForRemediation", func() {
BeforeEach(func() {
apiConnectivityCheckConfig.MinPeersForRemediation = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is redundant because was already updated in the AfterEach block

Context("no peer found, and using default setting for MinPeersForRemediation", func() {
BeforeEach(func() {
apiConnectivityCheckConfig.MinPeersForRemediation = 1
snrConfig.Spec.MinPeersForRemediation = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is the default value, so I think it would make more sense to add it in GenerateTestConfig

@@ -203,6 +205,134 @@ var _ = BeforeSuite(func() {

})

//var _ = BeforeEach(func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftovers ? probably can be removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for remediation on single worker node configurations
5 participants