Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KaaS robustness feature tests #714

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

cah-patrickthiem
Copy link
Contributor

@cah-patrickthiem cah-patrickthiem commented Aug 28, 2024

This PR will add tests for the K8s cluster robustness features defined in the scs standard: scs-0215-v1-robustness-features
Here is a detailed listing of what is tested:

SCS-0215-v1 Robustness Features Test Coverage
1. API Server Rate Limiting
Test_scs_0215_requestLimits

  • Verifies basic request limit configurations

  • Checks API server configuration for required settings

Test_scs_0215_minRequestTimeout

  • Validates min-request-timeout setting

  • Checks configuration in API server args

Test_scs_0215_eventRateLimit

  • Confirms EventRateLimit admission controller configuration

  • Verifies plugin is enabled in API server

Test_scs_0215_apiPriorityAndFairness

  • Checks APF feature gate enablement

  • Validates API server configuration for priority and fairness

Test_scs_0215_rateLimitValues

  • Verifies specific rate limit values

  • Checks recommended settings:

    • QPS: 5000

    • Burst: 20000

2. etcd Management
Test_scs_0215_etcdCompaction

  • Validates compaction configuration:

    • Mode: periodic

    • Retention: 8h

Test_scs_0215_etcdBackup

  • Verifies backup CronJobs setup

  • Checks backup configuration:

    • Hourly backups

    • Daily backups

    • Proper paths and schedules

3. Certificate Management
Test_scs_0215_certificateRotation

  • Check_Certificate_Rotation_Configuration:

    • Verifies kubelet certificate rotation settings

    • Validates serverTLSBootstrap and rotateCertificates

Check_Certificate_Controller:

  • Confirms cert-manager deployment

  • Validates certificate controller functionality

@cah-patrickthiem cah-patrickthiem self-assigned this Aug 28, 2024
@cah-patrickthiem cah-patrickthiem force-pushed the 549-testing-kaas-robustness-features branch from 4e8fc4d to 6d98860 Compare October 16, 2024 13:43
@mbuechse mbuechse linked an issue Nov 4, 2024 that may be closed by this pull request
3 tasks
@cah-patrickthiem cah-patrickthiem force-pushed the 549-testing-kaas-robustness-features branch from 5c2f787 to d0c4d95 Compare November 15, 2024 11:43
@cah-patrickthiem cah-patrickthiem force-pushed the 549-testing-kaas-robustness-features branch from d0c4d95 to cbcca65 Compare November 15, 2024 11:44
@cah-patrickthiem
Copy link
Contributor Author

For reference, here the successful test logs of sonobuoy:

cat results/plugins/scs-kaas-conformance/sonobuoy_results.yaml | yq
name: scs-kaas-conformance
status: passed
meta:
type: summary
items:

  • name: out.json
    status: passed
    meta:
    file: results/global/out.json
    type: file
    items:
    • name: Test_scs_0200_smoke
      status: passed
    • name: Test_scs_0215_requestLimits/Check_Request_Limit_Configuration
      status: passed
    • name: Test_scs_0215_requestLimits
      status: passed
    • name: Test_scs_0215_minRequestTimeout/Check_minRequestTimeout_Configuration
      status: passed
    • name: Test_scs_0215_minRequestTimeout
      status: passed
    • name: Test_scs_0215_eventRateLimit/Check_EventRateLimit_Configuration
      status: passed
    • name: Test_scs_0215_eventRateLimit
      status: passed
    • name: Test_scs_0215_apiPriorityAndFairness/Check_APF_Configuration
      status: passed
    • name: Test_scs_0215_apiPriorityAndFairness
      status: passed
    • name: Test_scs_0215_rateLimitValues/Check_Rate_Limit_Values
      status: passed
    • name: Test_scs_0215_rateLimitValues
      status: passed
    • name: Test_scs_0215_etcdCompaction/Check_Etcd_Compaction_Settings
      status: passed
    • name: Test_scs_0215_etcdCompaction
      status: passed
    • name: Test_scs_0215_etcdBackup/Check_Etcd_Backup_Configuration
      status: passed
    • name: Test_scs_0215_etcdBackup
      status: passed
    • name: Test_scs_0215_certificateRotation/Check_Certificate_Controller
      status: passed
    • name: Test_scs_0215_certificateRotation
      status: passed

[Displaying results...]
sonobuoy results *.tar.gz
Plugin: scs-kaas-conformance
Status: passed
Total: 17
Passed: 17
Failed: 0
Skipped: 0

@cah-patrickthiem
Copy link
Contributor Author

cah-patrickthiem commented Nov 15, 2024

In order to make the tests pass on your K8s cluster, you would need to apply the following configurations:

  1. API Server Configuration
    Location: /etc/kubernetes/manifests/kube-apiserver.yaml
apiVersion: v1
kind: Pod
metadata:
  name: kube-apiserver
  namespace: kube-system
spec:
  containers:
  - command:
    - kube-apiserver
    # Admission Control
    - --enable-admission-plugins=NodeRestriction,EventRateLimit
    - --admission-control-config-file=/etc/kubernetes/admission-config.yaml
    # API Priority
    - --feature-gates=APIPriorityAndFairness=true
    - --enable-priority-and-fairness=true
  1. Admission Configuration
    Location: /etc/kubernetes/admission-config.yaml
# event-ratelimit-config.yaml
kind: Configuration
apiVersion: eventratelimit.admission.k8s.io/v1alpha1
limits:
- burst: 20000
  qps: 5000
  type: Server
  1. etcd Configuration
    Location: /etc/kubernetes/manifests/etcd.yaml
apiVersion: v1
kind: Pod
metadata:
  name: etcd
  namespace: kube-system
spec:
  containers:
  - command:
    - etcd
    - --auto-compaction-mode=periodic
    - --auto-compaction-retention=8h
  1. Kubelet Configuration
    Location: /var/lib/kubelet/config.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
serverTLSBootstrap: true
rotateCertificates: true
  1. Certificate Management
    Install cert-manager
    kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.12.0/cert-manager.yaml
  2. etcd Backup CronJobs
    etcd-cronjobs.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
 name: etcd-backup-hourly
spec:
 schedule: "0 * * * *"
 jobTemplate:
   spec:
     template:
       spec:
         containers:
         - name: etcd-backup
           image: k8s.gcr.io/etcd:3.5.9-0
           command:
           - /bin/sh
           - -c
           - |
             ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
               --cacert=/etc/kubernetes/pki/etcd/ca.crt \
               --cert=/etc/kubernetes/pki/etcd/server.crt \
               --key=/etc/kubernetes/pki/etcd/server.key \
               snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db
           volumeMounts:
           - name: etcd-certs
             mountPath: /etc/kubernetes/pki/etcd
             readOnly: true
           - name: backup
             mountPath: /backup
         volumes:
         - name: etcd-certs
           hostPath:
             path: /etc/kubernetes/pki/etcd
             type: Directory
         - name: backup
           hostPath:
             path: /var/lib/etcd/backup/hourly
             type: DirectoryOrCreate
         restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
 name: etcd-backup-daily
spec:
 schedule: "0 0 * * *"
 jobTemplate:
   spec:
     template:
       spec:
         containers:
         - name: etcd-backup
           image: k8s.gcr.io/etcd:3.5.9-0
           command:
           - /bin/sh
           - -c
           - |
             ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
               --cacert=/etc/kubernetes/pki/etcd/ca.crt \
               --cert=/etc/kubernetes/pki/etcd/server.crt \
               --key=/etc/kubernetes/pki/etcd/server.key \
               snapshot save /backup/etcd-snapshot-$(date +%Y%m%d).db
           volumeMounts:
           - name: etcd-certs
             mountPath: /etc/kubernetes/pki/etcd
             readOnly: true
           - name: backup
             mountPath: /backup
         volumes:
         - name: etcd-certs
           hostPath:
             path: /etc/kubernetes/pki/etcd
             type: Directory
         - name: backup
           hostPath:
             path: /var/lib/etcd/backup/daily
             type: DirectoryOrCreate
         restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
 name: etcd-compaction
spec:
 schedule: "0 */8 * * *"
 jobTemplate:
   spec:
     template:
       spec:
         containers:
         - name: etcd-compaction
           image: k8s.gcr.io/etcd:3.5.9-0
           command:
           - /bin/sh
           - -c
           - |
             ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
               --cacert=/etc/kubernetes/pki/etcd/ca.crt \
               --cert=/etc/kubernetes/pki/etcd/server.crt \
               --key=/etc/kubernetes/pki/etcd/server.key \
               compact $(etcdctl endpoint status --write-out="json" | awk -F'"' '{print $4}')
           volumeMounts:
           - name: etcd-certs
             mountPath: /etc/kubernetes/pki/etcd
             readOnly: true
         volumes:
         - name: etcd-certs
           hostPath:
             path: /etc/kubernetes/pki/etcd
             type: Directory
         restartPolicy: OnFailure

Location: Apply via kubectl
kubectl apply -f etcd-cronjobs.yaml

@cah-patrickthiem cah-patrickthiem marked this pull request as ready for review November 15, 2024 11:56
@cah-patrickthiem
Copy link
Contributor Author

For reference, I used a self configured KubeAdm cluster to develop those tests.

@mbuechse
Copy link
Contributor

Impressive! I'm not sure I am competent to review it, but I will give it a shot. About these preconditions, wouldn't it be good to put them into a 'Testing and implementation notes' supplement? This can happen within this same PR.

@mbuechse
Copy link
Contributor

For reference, I used a self configured KubeAdm cluster to develop those tests.

Impressive again! Just for increased safety, could you please also test on moin once we have the necessary permissions?

@cah-patrickthiem
Copy link
Contributor Author

Impressive! I'm not sure I am competent to review it, but I will give it a shot. About these preconditions, wouldn't it be good to put them into a 'Testing and implementation notes' supplement? This can happen within this same PR.

I talked about including the configurations with @tonifinger. We came to the same conclusion. Also, my guess is that there will be more configuration snippets from the other tested features in other PRs.

@cah-patrickthiem
Copy link
Contributor Author

For reference, I used a self configured KubeAdm cluster to develop those tests.

Impressive again! Just for increased safety, could you please also test on moin once we have the necessary permissions?

Sure, I can do that.

t.Errorf("Required setting %s not found in API server configuration", setting)
}
}
if !foundSettings["EventRateLimit"] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this check also be carried out in the conditional statement in line 67? I think the error can already be logged there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, I changed that.

config, err := clientset.CoreV1().ConfigMaps(loc.namespace).Get(context.Background(), loc.name, metav1.GetOptions{})
if err == nil {
if data, ok := config.Data[loc.key]; ok {
if strings.Contains(data, "eventratelimit.admission.k8s.io") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from the test in Test_scs_0215_requestLimits()
Isn't this check already handled on this line L67
On second though, if this does the same check, I think it would be better to only handle it here in Test_scs_0215_minRequestTimeout(), as this is the testfunction related to "EventRateLimit"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, these are different tests. The first test checks if the EventRateLimit is enabled in the API server command line flags, while the second test specifically looks for EventRateLimit configuration in ConfigMaps. The second test is more thorough as it searches multiple locations for the actual configuration details.
The first test only verifies the admission plugin is enabled, while the second test verifies the configuration exists and is properly set up.

}

if isKindCluster(clientset) {
t.Skip("Running on kind cluster - skipping APF test")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must raise an error as well. Otherwise this will be unnoticed by the scs-test-runner.py in case someone does run this the testsuite against a kind cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the "skip tests if kind cluster" topic again. I tend to exclude those skipping statements. The tests should fail if the cluster cannot support the test features, that is the whole purpose of the tests.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. we should not allow to skip any tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will change that

}

if isKindCluster(clientset) {
t.Skip("Running on kind cluster - skipping rate limit values test")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above and to all other t.Skip related conditionals.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Comment on lines 218 to 230
for k, v := range expectedValues {
if !strings.Contains(config, fmt.Sprintf("%s: %s", k, v)) {
allFound = false
break
}
}
if allFound {
return
}
}
}

t.Error("Recommended rate limit values (qps: 5000, burst: 20000) not found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the standards, these values are described as RECOMMENDED and furthermore “SHOULD be adapted to the needs of the environment and the expected load”.
We should therefore not regard the values described in the standard as fixed values. Rather, we should check whether we meet them as minimum requirements.

See: ../scs-0215-v1-robustness-features.md#kube-api-rate-limiting-1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you are right here, I have overseen that it is just recommended. However, for that reason I exclude the test for now. Maybe some test for this could be added in the future if needed. The overall check for the presence of event rate limits is there.

}

if isKindCluster(clientset) {
t.Skip("Running on kind cluster - skipping etcd backup test")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

LabelSelector: "component=etcd",
})
if err != nil || len(pods.Items) == 0 {
t.Skip("No etcd pods found")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above this must throw an error as well. We currently don't consider someone using something else as etcd for k8s.
If there is the need to use something else then etcd the standard itself needs to be updated first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@cah-patrickthiem cah-patrickthiem removed the request for review from mbuechse December 5, 2024 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Testing] KaaS Robustness features
3 participants