Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Other] Yaook SCS cluster debugging #556

Closed
cah-hbaum opened this issue Apr 8, 2024 · 8 comments
Closed

[Other] Yaook SCS cluster debugging #556

cah-hbaum opened this issue Apr 8, 2024 · 8 comments
Assignees
Labels
SCS-VP10 Related to tender lot SCS-VP10

Comments

@cah-hbaum
Copy link
Contributor

This issue contains information/problems/data about debugging and working with the Yaook SCS cluster. It will be closed, when the parent issue is closed.

See #426

@anjastrunk
Copy link
Contributor

I suggest to log/fix each bug/problem in a separate issue, as done on #557 and list these issues in #426 in section "bug fixing". @cah-hbaum What do you think?

@cah-hbaum
Copy link
Contributor Author

No I think that would be too much overhead for no gain.
I would just log everything here in separate comments and link issues or similar things, if they're created externally.

I could've also done this in the separate issues already available for each standard, but most (or better all) bugs and problems are cluster-related and not specific to a standard.

@cah-hbaum
Copy link
Contributor Author

cah-hbaum commented Apr 9, 2024

08-04-2024
The virtualized Yaook cluster broke over the weekend. The exact reason isn't really known, but the problem was with multiple Openstack volumes managed by the Openstack CSI Cinder driver not being detached correctly. They would just hang around infinitely. Since our Openstack policy doesn't allow resetting volume states by users, I would have needed to involve our Operations team with this.
The problem could have been stemming from the fact, that one of the worker nodes wasn't in a ready state, so the ceph instance couldn't run on it, which probably prevented the detaching of volumes.

I tried to reset the Kubernetes cluster with yaook/k8s; this ejected the worker node, since the process failed because of problems with the two of the master nodes and couldn't finish rejoining the previously bad worker.
The master nodes were having problems with connecting to different debian repositories, probably because of high resource usage on the nodes.

After loosing a second master node, I decided to just reset the cluster completely, meaning deletion of all resources and a new cluster setup.

@cah-hbaum
Copy link
Contributor Author

cah-hbaum commented Apr 10, 2024

Had some problems with the new cluster, images couldn't seemingly be uploaded, neither from local files nor from a linked location.
This turned out to be a problem with glance and its secret containing the connection information to ceph. The secret wasn't copied correctly into the other namespace, resulting in an incorrect key distributed to glance, which then couldn't access ceph.

@cah-hbaum
Copy link
Contributor Author

Problems are fixed for now (already applied the fixes on friday). The problem seemed to come from incorrectly created roles for the neutron-ovn-operator initially. After I fixed them manually, the ovnagents seemed to be the problem. They were created without the status key, because it wasn't available in the CRD. I needed to manually update the CRD and fix the ovnagents. After that was done, the cluster was running correctly.

@cah-hbaum cah-hbaum moved this from Backlog to Doing in Sovereign Cloud Stack May 15, 2024
@cah-hbaum
Copy link
Contributor Author

cah-hbaum commented May 21, 2024

Addendum from last week (~15.05.2024):

I tried to setup yaook/k8s in order to test the Kubernetes standards on an independent cluster, which isn't in use by an overlying setup like yaook/operator.

To do this, I updated my already existing yaook/k8s git repository and pulled the latest version available.
This version was released after the so called core-split, which essentially reworked the structure of the repository as well as the cluster build processes.

With this new version, everything went smooth until the calico-api-server should come up. This wasn't possible, due to the taints NoSchedule not being removed from the Worker nodes. I couldn't find a reason why this was the case, so I removed them manually, which helped finish the setup process.
This setup was then tested via the test script, which went through without problem.

@cah-hbaum
Copy link
Contributor Author

Tried to setup the yaook/k8s cluster again last week, since the nodes wouldn't come out of the NotInitialized state. With the help of a colleague, I found out that this is probably a problem with the OpenstackCloudController component. Still this didn't throw any obvious errors. Im gonna investigate further this week.

@cah-hbaum
Copy link
Contributor Author

I would close this issue, the clusters seem stable for quite some time now and other problems are reported in other issues.

@github-project-automation github-project-automation bot moved this from Doing to Done in Sovereign Cloud Stack Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
SCS-VP10 Related to tender lot SCS-VP10
Projects
Status: Done
Development

No branches or pull requests

2 participants