Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CI stability #1843

Merged
merged 1 commit into from
Oct 31, 2023
Merged

Conversation

davidcassany
Copy link
Contributor

@davidcassany davidcassany commented Oct 11, 2023

This is brute force attempt to check if adding this simple no_timer_check kernel parameter is solving the tests flakiness.

[EDIT]

Adds several improvements:

  • Check VM pid on 'EventuallyConnects'. So it does not wait for a
    command to succeed if the underlaying VM crashed and it is not
    running.

  • Does not use '-daemonize' flag of qemu, now it simply runs on the
    background and the stdout and stderr are redirected to vmstdout file.

  • Does not install QEMU, the runner already has a recent QEMU version
    installed. This saves several minutes on each macos job.

  • Fixes some of the stability issues on macOS by disabling hugepages on
    the kernel. This is not supported on macOS.

@davidcassany davidcassany requested a review from a team as a code owner October 11, 2023 11:26
@davidcassany davidcassany marked this pull request as draft October 11, 2023 11:26
@kkaempf
Copy link
Contributor

kkaempf commented Oct 11, 2023

What is the "timer sync issue" ? Can you add a reference to the original bug report please ?

@kkaempf
Copy link
Contributor

kkaempf commented Oct 11, 2023

Ah, #1834 has some background

@davidcassany
Copy link
Contributor Author

Ah, #1834 has some background

Yes sorry, this was just an attempt to verify if this parameter was solving the flaky CI tests. The thing is that this seams to be an issue only appearing in MacOS runners, hence likely to be a problem with hvf and qemu. It is pretty annoying as pretty often we get a kernel panic a boot... 😞

We haven't ever seen it in any other env beyond MacOS github runners. So we actually need to make a PR to check if the issue is solved.

@davidcassany davidcassany force-pushed the fix_apic_timer_errors branch 2 times, most recently from a49ba5b to 73b619d Compare October 30, 2023 07:39
@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2023

Codecov Report

Attention: 5 lines in your changes are missing coverage. Please review.

Comparison is base (fd92278) 75.33% compared to head (f973c2b) 75.34%.
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1843      +/-   ##
==========================================
+ Coverage   75.33%   75.34%   +0.01%     
==========================================
  Files          67       67              
  Lines        6814     6818       +4     
==========================================
+ Hits         5133     5137       +4     
  Misses       1311     1311              
  Partials      370      370              
Files Coverage Δ
cmd/config/config.go 73.77% <100.00%> (ø)
pkg/config/config.go 96.34% <100.00%> (ø)
pkg/types/v1/config.go 87.02% <100.00%> (ø)
pkg/types/v1/grub.go 84.48% <100.00%> (+3.63%) ⬆️
cmd/build-disk.go 28.78% <66.66%> (ø)
pkg/action/build-disk.go 68.91% <71.42%> (-0.38%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@davidcassany davidcassany force-pushed the fix_apic_timer_errors branch 22 times, most recently from bb42655 to c0f8fc2 Compare October 31, 2023 00:25
@davidcassany davidcassany force-pushed the fix_apic_timer_errors branch 4 times, most recently from 24a8842 to 9b441ee Compare October 31, 2023 01:13
@davidcassany davidcassany changed the title WIP, checking the included kernel parameter solves the timer sync issues Improve CI stability Oct 31, 2023
@davidcassany
Copy link
Contributor Author

Still a draft as it builds on top of rancher-sandbox/ele-testhelpers#38, hence this should be merged first and then readapt go.mod on this project again.

@davidcassany davidcassany requested a review from frelon October 31, 2023 01:25
@@ -157,7 +157,6 @@ jobs:
- if: ${{ env.ARCH == 'x86_64' }}
name: Run VM script dependencies
run: |
brew update; brew upgrade qemu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this was added since there was an update to qemu which was not picked up by our runner, which broke the build... Probably not needed now!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I guess we were suffering MacOS runner upgrades. I remember there used to be a very old qemu version when we first started using it, now I see it comes with Qemu v8 by default.

Copy link
Contributor

@frelon frelon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! 🚀

@davidcassany davidcassany force-pushed the fix_apic_timer_errors branch from 9b441ee to aeffeec Compare October 31, 2023 08:35
@davidcassany davidcassany marked this pull request as ready for review October 31, 2023 08:36
@@ -157,7 +157,6 @@ jobs:
- if: ${{ env.ARCH == 'x86_64' }}
name: Run VM script dependencies
run: |
brew update; brew upgrade qemu
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I guess we were suffering MacOS runner upgrades. I remember there used to be a very old qemu version when we first started using it, now I see it comes with Qemu v8 by default.

@@ -6,19 +6,20 @@ SCRIPT=$(realpath -s "${0}")
SCRIPTS_PATH=$(dirname "${SCRIPT}")
TESTS_PATH=$(realpath -s "${SCRIPTS_PATH}/../tests")

: "${ELMNTL_PREFIX:=}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a prefix to the files so we could consider at some point running in parallel multiple tests on the same machine without overlapping files.

@@ -67,12 +68,12 @@ function start {
;;
esac

[ "hvf" == "${ELMNTL_ACCEL}" ] && accel_arg="-accel ${ELMNTL_ACCEL}" && firmware_arg="-bios ${ELMNTL_FIRMWARE} ${firmware_arg}"
[ "hvf" == "${ELMNTL_ACCEL}" ] && accel_arg="-accel ${ELMNTL_ACCEL}" && firmware_arg="-bios ${ELMNTL_FIRMWARE} ${firmware_arg}" && cpu_arg="-cpu max,-pdpe1gb"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this -pdpe1gb flag surfing the net... Apparently this disables hugepages and for some reason this is not supported in MacOS 🤷🏽‍♂️ I do believe this is the real change that fixes flakytests.

$(GINKGO) run $(GINKGO_ARGS) ./tests/installer
$(GINKGO) run $(GINKGO_ARGS) ./tests/smoke
VM_PID=$$(scripts/run_vm.sh vmpid) go run $(GINKGO) $(GINKGO_ARGS) ./tests/installer
VM_PID=$$(scripts/run_vm.sh vmpid) go run $(GINKGO) $(GINKGO_ARGS) ./tests/smoke
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably there are more elegant ways pass the VM pid to the test code, but so far I opted for this simple solution. This is not strictly required and can be omitted on local runs.

Adds several improvements:

* Check VM pid on 'EventuallyConnects'. So it does not wait for a
  command to succeed if the underlaying VM crashed and it is not
  running.

* Does not use '-daemonize' flag of qemu, now it simply runs on the
  background and the stdout and stderr are redirected to vmstdout file.

* Does not install QEMU, the runner already has a recent QEMU version
  installed. This saves several minutes on each macos job.

* Fixes some of the stability issues on macOS by disabling hugepages on
  the kernel. This is not supported on macOS.

Signed-off-by: David Cassany <[email protected]>
@davidcassany davidcassany force-pushed the fix_apic_timer_errors branch from aeffeec to f973c2b Compare October 31, 2023 08:46
@davidcassany davidcassany merged commit 6290bf3 into rancher:main Oct 31, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants