Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable CPU power management by default for libvirt compute #597

Conversation

gibizer
Copy link
Contributor

@gibizer gibizer commented Nov 16, 2023

Depends-On: openstack-k8s-operators/openstack-operator#591 (to pick up a new tcib) (merged)
Implementes: https://issues.redhat.com/browse/OSPRH-83

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/8b6fd928792e4e1da85e7ee48b3f8e78

✔️ nova-operator-content-provider SUCCESS in 1h 41m 48s
✔️ nova-operator-kuttl SUCCESS in 39m 07s
nova-operator-tempest-multinode FAILURE in 1h 26m 15s

@gibizer
Copy link
Contributor Author

gibizer commented Nov 16, 2023

The nova-compute service fails to start :/

2023-11-16 10:42:42.444 2 ERROR oslo_service.service [None req-56dbf76c-524c-455d-9c64-d3474509e8d0 - - - - - -] Error starting thread.: nova.exception.InvalidConfiguration: '[compute]/cpu_dedicated_set' is mandatory to be set if '[libvirt]/cpu_power_management' is set.Please provide the CPUs that can be pinned or don't use the power management if you only use shared CPUs.
2023-11-16 10:42:42.444 2 ERROR oslo_service.service Traceback (most recent call last):
2023-11-16 10:42:42.444 2 ERROR oslo_service.service   File "/usr/lib/python3.9/site-packages/oslo_service/service.py", line 806, in run_service
2023-11-16 10:42:42.444 2 ERROR oslo_service.service     service.start()
2023-11-16 10:42:42.444 2 ERROR oslo_service.service   File "/usr/lib/python3.9/site-packages/nova/service.py", line 162, in start
2023-11-16 10:42:42.444 2 ERROR oslo_service.service     self.manager.init_host(self.service_ref)
2023-11-16 10:42:42.444 2 ERROR oslo_service.service   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 1608, in init_host
2023-11-16 10:42:42.444 2 ERROR oslo_service.service     self.driver.init_host(host=self.host)
2023-11-16 10:42:42.444 2 ERROR oslo_service.service   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 831, in init_host
2023-11-16 10:42:42.444 2 ERROR oslo_service.service     libvirt_cpu.power_down_all_dedicated_cpus()
2023-11-16 10:42:42.444 2 ERROR oslo_service.service   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/cpu/api.py", line 122, in power_down_all_dedicated_cpus
2023-11-16 10:42:42.444 2 ERROR oslo_service.service     raise exception.InvalidConfiguration(msg)
2023-11-16 10:42:42.444 2 ERROR oslo_service.service nova.exception.InvalidConfiguration: '[compute]/cpu_dedicated_set' is mandatory to be set if '[libvirt]/cpu_power_management' is set.Please provide the CPUs that can be pinned or don't use the power management if you only use shared CPUs.

@SeanMooney
Copy link
Contributor

this is a bug in nova which we should fix.
it should be a noop when cpu_dedicated_set is not defiend.

@gibizer
Copy link
Contributor Author

gibizer commented Nov 16, 2023

Filed an upstream nova bug to relax the startup config check https://bugs.launchpad.net/nova/+bug/2043707

@gibizer
Copy link
Contributor Author

gibizer commented Nov 22, 2023

Waiting for the upstream fix land in Antelope
Allow enabling cpu_power_management with 0 dedicated CPUs

@gibizer
Copy link
Contributor Author

gibizer commented Dec 5, 2023

The upstream nova fix merged in 2023.1[1] we just need to wait for the tcib image to be built. (need a newer tcib than 03a7b1adc9a02f73bcd7593e55c1d943)

[1] https://review.opendev.org/c/openstack/nova/+/901660

@gibizer
Copy link
Contributor Author

gibizer commented Dec 7, 2023

The upstream nova fix merged in 2023.1[1] we just need to wait for the tcib image to be built. (need a newer tcib than 03a7b1adc9a02f73bcd7593e55c1d943)

[1] https://review.opendev.org/c/openstack/nova/+/901660

We have a fresh nova-compute image to try https://quay.io/repository/podified-antelope-centos9/openstack-nova-compute/manifest/sha256:53728912e768f56b124c39955e322e61fa54fc5aa2f701535ae13108a11ade2b

@gibizer
Copy link
Contributor Author

gibizer commented Dec 7, 2023

The upstream nova fix merged in 2023.1[1] we just need to wait for the tcib image to be built. (need a newer tcib than 03a7b1adc9a02f73bcd7593e55c1d943)
[1] https://review.opendev.org/c/openstack/nova/+/901660

We have a fresh nova-compute image to try https://quay.io/repository/podified-antelope-centos9/openstack-nova-compute/manifest/sha256:53728912e768f56b124c39955e322e61fa54fc5aa2f701535ae13108a11ade2b

we need a dataplane-operator bump to pick up the new container image as a default. Bumping in openstack-k8s-operators/openstack-operator#591

@gibizer
Copy link
Contributor Author

gibizer commented Dec 8, 2023

[root@edpm-compute-0 ~]# grep powered /var/log/containers/nova/nova-compute.log
2023-12-08 09:43:19.314 2 DEBUG nova.virt.libvirt.cpu.api [None req-2d49bef8-ecde-4bf8-8083-3e5584395458 - - - - - -] Cores powered down : set() power_down_all_dedicated_cpus /usr/lib/python3.9/site-packages/nova/virt/libvirt/cpu/api.py:123

Also visible in CI https://review.rdoproject.org/zuul/build/41b8ad4d2bdf42c2a2fc35361ea4d092/log/controller/ci-framework-data/logs/192.168.122.100/log/containers/nova/nova-compute.log#2164

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/e0e8b8f3eff241da966c7aecc9d61380

✔️ nova-operator-content-provider SUCCESS in 2h 26m 24s
✔️ nova-operator-kuttl SUCCESS in 39m 05s
nova-operator-tempest-multinode FAILURE in 2h 03m 58s

@gibizer
Copy link
Contributor Author

gibizer commented Dec 8, 2023

recheck
timeout

{1} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_hotplug_nic [] ... inprogress

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/65da5c3479a94c44b59b85ae4f11d5a9

✔️ nova-operator-content-provider SUCCESS in 2h 15m 01s
✔️ nova-operator-kuttl SUCCESS in 37m 40s
nova-operator-tempest-multinode FAILURE in 1h 56m 55s

@gibizer
Copy link
Contributor Author

gibizer commented Dec 8, 2023

recheck

Details: Host list [{'zone': 'nova', 'host_name': 'compute-1.ci-rdo.local'}, {'zone': 'nova', 'host_name': 'compute-0.ci-rdo.local'}] is shorter than min_compute_nodes. Did a compute worker not boot correctly?

Spot on tempest. The 3rd compute cannot connect to the message bus:

2023-12-08 13:11:42.521 2 ERROR oslo.messaging._drivers.impl_rabbit [None req-eba3ba3c-6dd7-400c-bcb1-67df26e778ff - - - - - -] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 1.0 seconds): OSError: [Errno 113] EHOSTUNREACH

https://review.rdoproject.org/zuul/build/0854d0debc304970b42a17dfc108eb47/log/controller/ci-framework-data/logs/192.168.122.102/log/containers/nova/nova-compute.log

This is a new type of instability that two out of three computes can connect but not the third.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/5c3fcbea01df4af58b452198ed442ba0

✔️ nova-operator-content-provider SUCCESS in 2h 12m 50s
✔️ nova-operator-kuttl SUCCESS in 38m 58s
nova-operator-tempest-multinode FAILURE in 1h 54m 36s

@gibizer
Copy link
Contributor Author

gibizer commented Dec 9, 2023

recheck
see if tempest is faster than 49minutes during the weekend or if during the weekend it the overall deploy is faster

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/e940d80a99854fc0afea73235c8ba237

✔️ nova-operator-content-provider SUCCESS in 2h 10m 52s
✔️ nova-operator-kuttl SUCCESS in 35m 51s
nova-operator-tempest-multinode FAILURE in 1h 53m 05s

@gibizer gibizer force-pushed the enable-power-management branch from d526ba1 to 3786c38 Compare December 11, 2023 08:44
Copy link
Contributor

@mrkisaolamb mrkisaolamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

openshift-ci bot commented Dec 12, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gibizer, mrkisaolamb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [gibizer,mrkisaolamb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 3d19fcc into openstack-k8s-operators:main Dec 12, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants