Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable minimum worker nodecount #238

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ bin

# editor and IDE paraphernalia
.idea
*.iml
*.swp
*.swo
*~
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ manifests: controller-gen ## Generate WebhookConfiguration, ClusterRole and Cust
.PHONY: generate
generate: controller-gen protoc ## Generate code containing DeepCopy, DeepCopyInto, and DeepCopyObject method implementations. Also generate protoc / gRPC code
$(CONTROLLER_GEN) object:headerFile="hack/boilerplate.go.txt" paths="./..."
PATH=$(PATH):$(shell pwd)/bin/proto/bin && $(PROTOC) --go_out=. --go-grpc_out=. pkg/peerhealth/peerhealth.proto
PATH='$(PATH)':$(shell pwd)/bin/proto/bin && $(PROTOC) --go_out=. --go-grpc_out=. pkg/peerhealth/peerhealth.proto
clobrano marked this conversation as resolved.
Show resolved Hide resolved

.PHONY: fmt
fmt: ## Run go fmt against code.
Expand Down
10 changes: 10 additions & 0 deletions api/v1alpha1/selfnoderemediationconfig_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ const (
ConfigCRName = "self-node-remediation-config"
defaultWatchdogPath = "/dev/watchdog"
defaultIsSoftwareRebootEnabled = true
defaultMinPeersForRemediation = 1
)

// SelfNodeRemediationConfigSpec defines the desired state of SelfNodeRemediationConfig
Expand Down Expand Up @@ -127,6 +128,15 @@ type SelfNodeRemediationConfigSpec struct {
// CustomDsTolerations allows to add custom tolerations snr agents that are running on the ds in order to support remediation for different types of nodes.
// +optional
CustomDsTolerations []v1.Toleration `json:"customDsTolerations,omitempty"`

// Minimum number of peer workers/control nodes to attempt to contact before deciding if node is unhealthy or not
// if set to zero, no other peers will be required to be present for remediation action to occur when this
// node has lost API server access. If an insufficient number of peers are found, we will not attempt to ask
// any peer nodes (if present) whether they see that the current node has been marked unhealthy with a
// SelfNodeRemediation CR
// +kubebuilder:default:=1
// +kubebuilder:validation:Minimum=0
MinPeersForRemediation int `json:"minPeersForRemediation,omitempty"`
}

// SelfNodeRemediationConfigStatus defines the observed state of SelfNodeRemediationConfig
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,17 @@ spec:
its peers.
minimum: 1
type: integer
minPeersForRemediation:
default: 1
description: "Minimum number of peer workers/control nodes to attempt
to contact before deciding if node is unhealthy or not\n\tif set
to zero, no other peers will be required to be present for remediation
action to occur when this\n\tnode has lost API server access. If
an insufficient number of peers are found, we will not attempt to
ask\n\tany peer nodes (if present) whether they see that the current
node has been marked unhealthy with a\n\tSelfNodeRemediation CR"
minimum: 0
type: integer
peerApiServerTimeout:
default: 5s
description: |-
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,17 @@ spec:
its peers.
minimum: 1
type: integer
minPeersForRemediation:
default: 1
description: "Minimum number of peer workers/control nodes to attempt
to contact before deciding if node is unhealthy or not\n\tif set
to zero, no other peers will be required to be present for remediation
action to occur when this\n\tnode has lost API server access. If
an insufficient number of peers are found, we will not attempt to
ask\n\tany peer nodes (if present) whether they see that the current
node has been marked unhealthy with a\n\tSelfNodeRemediation CR"
minimum: 0
type: integer
peerApiServerTimeout:
default: 5s
description: |-
Expand Down
1 change: 1 addition & 0 deletions controllers/selfnoderemediationconfig_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -153,6 +153,7 @@ func (r *SelfNodeRemediationConfigReconciler) syncConfigDaemonSet(ctx context.Co
data.Data["PeerRequestTimeout"] = snrConfig.Spec.PeerRequestTimeout.Nanoseconds()
data.Data["MaxApiErrorThreshold"] = snrConfig.Spec.MaxApiErrorThreshold
data.Data["EndpointHealthCheckUrl"] = snrConfig.Spec.EndpointHealthCheckUrl
data.Data["MinPeersForRemediation"] = snrConfig.Spec.MinPeersForRemediation
data.Data["HostPort"] = snrConfig.Spec.HostPort
data.Data["IsSoftwareRebootEnabled"] = fmt.Sprintf("\"%t\"", snrConfig.Spec.IsSoftwareRebootEnabled)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,7 @@ var _ = Describe("SNR Config Test", func() {
Expect(createdConfig.Spec.ApiServerTimeout.Seconds()).To(BeEquivalentTo(5))
Expect(createdConfig.Spec.ApiCheckInterval.Seconds()).To(BeEquivalentTo(15))
Expect(createdConfig.Spec.PeerUpdateInterval.Seconds()).To(BeEquivalentTo(15 * 60))
Expect(createdConfig.Spec.MinPeersForRemediation).To(BeEquivalentTo(1))
})
})

Expand Down
16 changes: 9 additions & 7 deletions controllers/tests/config/suite_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -120,14 +120,16 @@ var _ = BeforeSuite(func() {
Expect(err).ToNot(HaveOccurred())

certReader = certificates.NewSecretCertStorage(k8sClient, ctrl.Log.WithName("SecretCertStorage"), shared.Namespace)

apiConnectivityCheckConfig := &apicheck.ApiConnectivityCheckConfig{
Log: ctrl.Log.WithName("api-check"),
MyNodeName: shared.UnhealthyNodeName,
CheckInterval: shared.ApiCheckInterval,
MaxErrorsThreshold: shared.MaxErrorThreshold,
Peers: peers,
Cfg: cfg,
CertReader: certReader,
Log: ctrl.Log.WithName("api-check"),
MyNodeName: shared.UnhealthyNodeName,
CheckInterval: shared.ApiCheckInterval,
MaxErrorsThreshold: shared.MaxErrorThreshold,
MinPeersForRemediation: shared.MinPeersForRemediation,
Peers: peers,
Cfg: cfg,
CertReader: certReader,
}
apiCheck := apicheck.New(apiConnectivityCheckConfig, nil)
err = k8sManager.Add(apiCheck)
Expand Down
64 changes: 37 additions & 27 deletions controllers/tests/controller/selfnoderemediation_controller_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"github.com/medik8s/self-node-remediation/controllers"
"github.com/medik8s/self-node-remediation/controllers/tests/shared"
"github.com/medik8s/self-node-remediation/pkg/utils"
"github.com/medik8s/self-node-remediation/pkg/watchdog"
)

const (
Expand All @@ -44,6 +45,9 @@ var _ = Describe("SNR Controller", func() {
snr.Namespace = snrNamespace
snrConfig = shared.GenerateTestConfig()
time.Sleep(time.Second * 2)

// reset watchdog for each test!
dummyDog.Reset()
})

JustBeforeEach(func() {
Expand All @@ -62,6 +66,10 @@ var _ = Describe("SNR Controller", func() {
k8sClient.ShouldSimulateFailure = false
k8sClient.ShouldSimulatePodDeleteFailure = false
isAdditionalSetupNeeded = false

By("Restore default settings for api connectivity check")
apiConnectivityCheckConfig.MinPeersForRemediation = 1

deleteRemediations()
deleteSelfNodeRemediationPod()
//clear node's state, this is important to remove taints, label etc.
Expand Down Expand Up @@ -183,7 +191,7 @@ var _ = Describe("SNR Controller", func() {

verifyTimeHasBeenRebootedExists(snr)

verifyNoWatchdogFood()
verifyWatchdogTriggered()

verifyEvent("Normal", "NodeReboot", "Remediation process - about to attempt fencing the unhealthy node by rebooting it")

Expand Down Expand Up @@ -233,7 +241,7 @@ var _ = Describe("SNR Controller", func() {

verifyTimeHasBeenRebootedExists(snr)

verifyNoWatchdogFood()
verifyWatchdogTriggered()

// The kube-api calls for VA fail intentionally. In this case, we expect the snr agent to try
// to delete node resources again. So LastError is set to the error every time Reconcile()
Expand Down Expand Up @@ -315,7 +323,7 @@ var _ = Describe("SNR Controller", func() {

verifyTimeHasBeenRebootedExists(snr)

verifyNoWatchdogFood()
verifyWatchdogTriggered()

verifyFinalizerExists(snr)

Expand Down Expand Up @@ -436,31 +444,29 @@ var _ = Describe("SNR Controller", func() {
})

Context("Unhealthy node without api-server access", func() {

// this is not a controller test anymore... it's testing peers. But keep it here for now...

BeforeEach(func() {
By("Simulate api-server failure")
k8sClient.ShouldSimulateFailure = true
remediationStrategy = v1alpha1.ResourceDeletionRemediationStrategy
})

It("Verify that watchdog is not receiving food after some time", func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, any particular reason this test was removed ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it wasn't removed, I changed the text to what I believed possibly described it better. It's now in a separate context under the general context of "unhealthy node without api-server access" since both have similar requirements.

lastFoodTime := dummyDog.LastFoodTime()
timeout := dummyDog.GetTimeout()
Eventually(func() bool {
newTime := dummyDog.LastFoodTime()
// ensure the timeout passed
timeoutPassed := time.Now().After(lastFoodTime.Add(3 * timeout))
// ensure wd wasn't feeded
missedFeed := newTime.Before(lastFoodTime.Add(timeout))
if timeoutPassed && missedFeed {
return true
}
lastFoodTime = newTime
return false
}, 10*shared.PeerUpdateInterval, timeout).Should(BeTrue())
Context("no peer found", func() {
It("Verify that watchdog is not triggered", func() {
verifyWatchdogNotTriggered()
})
})

Context("no peer found and MinPeersForRemediation is configured to 0", func() {
It("Does not receive peer communication and since configured to need zero peers, initiates a reboot",
func() {
verifyWatchdogNotTriggered()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is duplicating above test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I included this because I need to ensure that the watchdog hasn't already been triggered. If the watchdog was already triggered, then the test would succeed as a false positive.

The other test may have tested it individually but that doesn't guarantee that by the time this test executes that it was in that state prior to execution. Test ordering could have a dramatic effect on this.

I can remove it to be sure if you really want that, I just wanted to avoid possible future false positives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general the goal is that test ordering should not matter. BeforeEach() and AfterEach() should bring the system into a known initial state for each test. I know though that this isn't easy always, and for these tests ordering did matter for sure. However, with the new reset of the watchdog that should be fixed? There also is a test that Reset() works as expected 🤔

But if test ordering still is an issue for this specific test, as a quick workaround verifyWatchdogNotTriggered() should be added after the reset call, not here, so it's bullet proof for all tests. I would revisit the reset() function in a follow up then.


apiConnectivityCheckConfig.MinPeersForRemediation = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: having this in a BeforeEach() would a bit cleaner

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks - I originally had it in a BeforeEach but due to other context I had thought that it was preferable by the team for it to be moved into the test - since it was only for this one individual test.

will move it back and submit after I'm able to get the tests running again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit matter of taste, whether it's worth for a single test to have a BeforeEach() block... I think it's cleaner, but again, a nit only and no blocker 🙂


verifyWatchdogTriggered()
})
})

})

Context("Configuration is missing", func() {
Expand Down Expand Up @@ -532,12 +538,16 @@ func verifyNodeIsSchedulable() {
}, 95*time.Second, 250*time.Millisecond).Should(BeFalse())
}

func verifyNoWatchdogFood() {
By("Verify that watchdog is not receiving food")
currentLastFoodTime := dummyDog.LastFoodTime()
ConsistentlyWithOffset(1, func() time.Time {
return dummyDog.LastFoodTime()
}, 5*dummyDog.GetTimeout(), 1*time.Second).Should(Equal(currentLastFoodTime), "watchdog should not receive food anymore")
func verifyWatchdogTriggered() {
EventuallyWithOffset(1, func(g Gomega) {
g.Expect(dummyDog.Status()).To(Equal(watchdog.Triggered))
}, 5*dummyDog.GetTimeout(), 1*time.Second).Should(Succeed(), "watchdog should be triggered")
}

func verifyWatchdogNotTriggered() {
ConsistentlyWithOffset(1, func(g Gomega) {
g.Expect(dummyDog.Status()).To(Equal(watchdog.Armed))
}, 5*dummyDog.GetTimeout(), 1*time.Second).Should(Succeed(), "watchdog should not be triggered")
}

func verifyFinalizerExists(snr *v1alpha1.SelfNodeRemediation) {
Expand Down
32 changes: 17 additions & 15 deletions controllers/tests/controller/suite_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -53,13 +53,14 @@ import (
// http://onsi.github.io/ginkgo/ to learn more about Ginkgo.

var (
testEnv *envtest.Environment
dummyDog watchdog.Watchdog
unhealthyNode, peerNode = &v1.Node{}, &v1.Node{}
cancelFunc context.CancelFunc
k8sClient *shared.K8sClientWrapper
fakeRecorder *record.FakeRecorder
snrConfig *selfnoderemediationv1alpha1.SelfNodeRemediationConfig
testEnv *envtest.Environment
dummyDog watchdog.Watchdog
unhealthyNode, peerNode = &v1.Node{}, &v1.Node{}
cancelFunc context.CancelFunc
k8sClient *shared.K8sClientWrapper
fakeRecorder *record.FakeRecorder
snrConfig *selfnoderemediationv1alpha1.SelfNodeRemediationConfig
apiConnectivityCheckConfig *apicheck.ApiConnectivityCheckConfig
)

var unhealthyNodeNamespacedName = client.ObjectKey{
Expand Down Expand Up @@ -152,14 +153,15 @@ var _ = BeforeSuite(func() {
Expect(err).ToNot(HaveOccurred())

rebooter := reboot.NewWatchdogRebooter(dummyDog, ctrl.Log.WithName("rebooter"))
apiConnectivityCheckConfig := &apicheck.ApiConnectivityCheckConfig{
Log: ctrl.Log.WithName("api-check"),
MyNodeName: shared.UnhealthyNodeName,
CheckInterval: shared.ApiCheckInterval,
MaxErrorsThreshold: shared.MaxErrorThreshold,
Peers: peers,
Rebooter: rebooter,
Cfg: cfg,
apiConnectivityCheckConfig = &apicheck.ApiConnectivityCheckConfig{
Log: ctrl.Log.WithName("api-check"),
MyNodeName: shared.UnhealthyNodeName,
CheckInterval: shared.ApiCheckInterval,
MaxErrorsThreshold: shared.MaxErrorThreshold,
Peers: peers,
Rebooter: rebooter,
Cfg: cfg,
MinPeersForRemediation: shared.MinPeersForRemediation,
}
apiCheck := apicheck.New(apiConnectivityCheckConfig, nil)
err = k8sManager.Add(apiCheck)
Expand Down
9 changes: 5 additions & 4 deletions controllers/tests/shared/shared.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,10 @@ import (
)

const (
PeerUpdateInterval = 30 * time.Second
ApiCheckInterval = 1 * time.Second
MaxErrorThreshold = 1
PeerUpdateInterval = 30 * time.Second
ApiCheckInterval = 1 * time.Second
MaxErrorThreshold = 1
MinPeersForRemediation = 1
// CalculatedRebootDuration is the mock calculator's result
CalculatedRebootDuration = 3 * time.Second
Namespace = "self-node-remediation"
Expand Down Expand Up @@ -57,7 +58,7 @@ func GenerateTestConfig() *selfnoderemediationv1alpha1.SelfNodeRemediationConfig
Name: selfnoderemediationv1alpha1.ConfigCRName,
Namespace: Namespace,
},
Spec: selfnoderemediationv1alpha1.SelfNodeRemediationConfigSpec{},
Spec: selfnoderemediationv1alpha1.SelfNodeRemediationConfigSpec{MinPeersForRemediation: 1},
}
}

Expand Down
2 changes: 2 additions & 0 deletions install/self-node-remediation-deamonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,8 @@ spec:
value: {{.EndpointHealthCheckUrl}}
- name: HOST_PORT
value: "{{.HostPort}}"
- name: MIN_PEERS_FOR_REMEDIATION
value: "{{.MinPeersForRemediation}}"
image: {{.Image}}
imagePullPolicy: Always
volumeMounts:
Expand Down
21 changes: 19 additions & 2 deletions pkg/apicheck/check.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ type ApiConnectivityCheckConfig struct {
PeerRequestTimeout time.Duration
PeerHealthPort int
MaxTimeForNoPeersResponse time.Duration
MinPeersForRemediation int
}

func New(config *ApiConnectivityCheckConfig, controlPlaneManager *controlplane.Manager) *ApiConnectivityCheck {
Expand Down Expand Up @@ -129,12 +130,28 @@ func (c *ApiConnectivityCheck) getWorkerPeersResponse() peers.Response {

c.config.Log.Info("Error count exceeds threshold, trying to ask other nodes if I'm healthy")
peersToAsk := c.config.Peers.GetPeersAddresses(peers.Worker)
if peersToAsk == nil || len(peersToAsk) == 0 {
c.config.Log.Info("Peers list is empty and / or couldn't be retrieved from server, nothing we can do, so consider the node being healthy")

// We check to see if we have at least the number of peers that the user has configured as required.
// If we don't have this many peers (for instance there are zero peers, and the default value is set
// which requires at least one peer), we don't want to remediate. In this case we have some confusion
// and don't want to remediate a node when we shouldn't. Note: It would be unusual for MinPeersForRemediation
// to be greater than 1 unless the environment has specific requirements.
if len(peersToAsk) < c.config.MinPeersForRemediation {
c.config.Log.Info("Ignoring api-server error as we have an insufficient number of peers found, "+
"so we aren't going to attempt to contact any to check for a SelfNodeRemediation CR"+
" - we will consider it as if there was no CR present & as healthy.", "minPeersRequired",
c.config.MinPeersForRemediation, "actualNumPeersFound", len(peersToAsk))

// TODO: maybe we need to check if this happens too much and reboot
return peers.Response{IsHealthy: true, Reason: peers.HealthyBecauseNoPeersWereFound}
}

// If we make it here and there are no peers, we can't proceed because we need at least one peer
// to check. So it doesn't make sense to continue on - we'll mark as unhealthy and exit fast
if len(peersToAsk) == 0 {
return peers.Response{IsHealthy: false, Reason: peers.UnHealthyBecauseNodeIsIsolated}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a log statement before returning here, similar to the other returns

}

apiErrorsResponsesSum := 0
nrAllPeers := len(peersToAsk)
// peersToAsk is being reduced at every iteration, iterate until no peers left to ask
Expand Down
Loading