forked from kubeflow/kubeflow
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RHOAIENG-14784: tests(odh-notebook-controller): add upgrade test that compares controller does not modify deployed pods during upgrade #430
Open
jiridanek
wants to merge
14
commits into
opendatahub-io:main
Choose a base branch
from
jiridanek:jd_update_test_spike
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+22,475
−235
Open
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
a82e955
add notebook-controller as test dep for odh-notebook-controller upgrade
jiridanek 029b68d
dump of 2.13.0 YAMLs, without suspicious secrets as I don't want to l…
jiridanek 1986b34
commit README.md about upgrade tests
jiridanek c3db827
commit the tests themselves, using ginkgo v1
jiridanek d383c19
add test to makefile, gingko v1 version
jiridanek 14fb2e7
fixup gomod as I'm now using prometheus and go-cmp directly
jiridanek a4dea5b
fixup! commit the tests themselves, using ginkgo v1
jiridanek d6bcff1
fixup! commit the tests themselves, using ginkgo v1
jiridanek 13abd6d
update to latest route crd
jiridanek fb92efe
NO-JIRA: temporarily comment out proper host name validation in route…
jiridanek 673c222
fixup! add test to makefile, gingko v1 version
jiridanek 12fbd31
fixup! dump of 2.13.0 YAMLs, without suspicious secrets as I don't wa…
jiridanek fc3b1b9
enable culling controller
jiridanek f8e58df
extract stability check
jiridanek File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
511 changes: 279 additions & 232 deletions
511
components/odh-notebook-controller/config/crd/external/route.openshift.io_routes.yaml
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
## Motivation | ||
|
||
Thinking about not restarting user workbenches is hard and this test should provide an automated check for that. | ||
|
||
## Fidelity considerations | ||
|
||
Beside this, we need to also run [odh-e2e test](https://github.com/skodjob/odh-e2e) that use actual controllers built as containers installed from OLM bundle on an OpenShift cluster. | ||
That test is much more faithful to a real production deployment, although it's true we only run it with ephemeral short-lived single-node clusters. | ||
The advantage of the test here is that it can be run very quickly from the project checkout directory with the current sourcecode. | ||
It is enough to use `make envtest` and `go test` to run this, no building of images and no additional setup is required. | ||
Meaning that it's possible to iterate on this quickly and still have a fairly comprehensive and realistic upgrade test. | ||
|
||
# Test data for testing (simulated) operator upgrade | ||
|
||
Each directory holds CRs dumped from a particular Red Hat OpenShift AI deployment. | ||
We are using Red Hat OpenShift AI release versioning here to name the directories. | ||
|
||
| ODH version | RHOAI version | opendatahub-io/kubeflow release | | ||
|-------------|---------------|---------------------------------| | ||
| | 2.13 | | | ||
| | | | | ||
|
||
The test deploys (current) CRDs, then deploys (old) CRs that simulate a lived-in product installation, then starts (current) controllers and asserts that nothing adverse happened. | ||
|
||
1. old CRs cannot be deployed with current CRDs | ||
* that means we made an incompatible changes to CRDs and that change must be reverted | ||
2. current operator does not run correctly with old CRs | ||
* incompatible change in the operator must be reverted and done again correctly | ||
3. current operator modifies old CRs in an impermissible way, such as if it modifies statefulset spec leading to notebook pod restart | ||
* incompatible change in the operator must be reverted and done again correctly | ||
|
||
This way we will test upgrade from any old version to the current (in-development) version. | ||
This should mean that no supported upgrade path will surprise us, because we have tested a superset. | ||
If an upgrade from an older version is both unsupported and also actually not working correctly, only then we remove it from these tests. | ||
|
||
We will test even upgrades that are not supported according to the various support policies of vendors. | ||
If the upgrade can actually be performed correctly, that's fine; if not, we will remove the data directory, delete the test, | ||
and leave a note in the README why upgroade is not working. | ||
|
||
## More on envtest environment | ||
|
||
These tests run in kubernetes testenv, which is a minimal Kubernetes distribution of just the etcd and kube-apiserver. | ||
Notably it is without a kubelet and without other controllers, not even the [kube-controller-manager](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-controller-manager) is present. | ||
OpenShift extensions are not present either, so we need to apply our own Route CRD. | ||
|
||
Since kubelet is missing, no actual containers are being started when things get deployed to the kube-apiserver. | ||
Furthermore, since kube-controller-manager is missing, there aren't even the standard controllers that create pods from deployments or statefulsets present. | ||
That means that any deployed statefulset will lay untouched and pod resources are not created from it. | ||
|
||
## Obtaining test data | ||
|
||
Use the commands given below to dump an existing namespace on a working, deployed cluster for the purpose of upgrade testing. | ||
Work in the Dashboard UI, we want to simulate a setup that an user might conceivable create. | ||
|
||
1. in DSCi, set certificates to Managed | ||
2. create DSC as the default one | ||
3. in dashboard, work as the `developer` user and create a `developer` namespace | ||
4. create a Data Connection, use the [AWS example credentials](https://docs.aws.amazon.com/STS/latest/APIReference/API_GetAccessKeyInfo.html) | ||
* Id: `AKIAIOSFODNN7EXAMPLE` | ||
* Key: `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY` | ||
5. still as `developer`, spawn a Workbench | ||
|
||
## Dumping command | ||
|
||
Here's the commands that will dump what's in your `developer` namespace into yaml files for use in these tests. | ||
When adding a new RHOAI version, place the dump into `data/rhoai-version` and commit the files. | ||
|
||
```shell | ||
# dumps all resources in `developer` namespace to YAMLs | ||
kubectl api-resources --verbs=list --namespaced -o name > api-resources.txt | ||
cat api-resources.txt | \ | ||
grep -v '^events$' | grep -v '^events.events.k8s.io$' | grep -v '^packagemanifests.packages.operators.coreos.com$' | \ | ||
xargs -n 1 kubectl get -o=name -n developer > resource-names.txt | ||
cat resource-names.txt | while read line; do | ||
mkdir -p "$(dirname ${line})" | ||
kubectl get -o=yaml -n developer "${line}" > "${line}.yaml" | ||
done | ||
``` | ||
|
||
## Audit logging | ||
|
||
In `data/configs` there is audit-policy for enabling audit logging from kube-apiserver. | ||
|
||
One useful `jq` command to filter out events by type from the audit log is | ||
|
||
```jq | ||
select( | ||
.verb != "get" and .verb != "watch" and .verb != "list" and .verb != "create" | ||
) | ||
``` |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we don't want to do this, I can put the test directly under components/ as a separate component; but I'd still like to do it this way, so if there's no opposition to the general direction, I'd like to go test it out in cpaas