-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large const allocations no longer lead to graceful errors on aarch64 #135952
Comments
The difference in behavior here is of course that the PR in question also changed the compiler to use jemalloc, which is now passing I thought we were already using jemalloc on x86_64; my x86_64 desktop has the same logical cores and memory as my c6g.metal instance, as well as the same overcommit settings. So it's quite odd that there is a difference in behavior here. I wonder if this is somehow related to the variable page size on aarch64? |
The difference-maker here is definitely enabling |
cc @mrkajetanp |
We were never able to reproduce this outside of Docker, when running either natively on AArch64 Linux or in qemu AArch64 Linux those tests pass just fine, that's why we filed this under a Docker issue and skipped them in the CI. More investigation needed indeed, thanks for pointing it out! |
That is something that seems to be a valid reason to reject a PR. |
To clarify, whatever the original problem is it was not introduced with the CI changes in question. We were skipping large_const_alloc inside our internal CI runs ~6 months ago already, and those runs weren't even using jemalloc. I'm a little surprised that the CI changes modify the behaviour there. It'd be nice to confirm that it's definitely switching on jemalloc that's making it behave differently and not something else. What I don't get is, how come the CI changes make the diagnostic not pop-up? Does that specific diagnostic behave differently depending on whether jemalloc is switched on or not? |
The diagnostic originates here (this error is later turned into the diagnostic):
Since |
I can confirm that it is definitely switching on jemalloc that makes it behave differently. This is quite easy to check in an aarch64 dev environment, just set I've also validated that I can reproduce this regression on my pinebook as well as an EC2 metal instance. I won't be setting up a rustc dev environment on my pinebook to check that the jemalloc setting specifically can toggle between the behaviors. Do we know exactly what you were seeing from the tests ~6 months ago? Does it line up with the "use all the system's memory then die with SIGKILL" that I am reporting here? |
This was just doing a Not be confused with skipping the tests here which is a separate potential issue. |
I might be wrong, but it seems to me like there's an issue with how said diagnostic gets triggered? I.e., in some environments like aarch64 + jemalloc or the Docker environment we had the compiler incorrectly thinks that this is an okay thing to do. Those might just be two different ways to trigger the same issue. |
The diagnostic is a direct response to And, hopefully as an outcome of that, a determination of whether this test should be enabled anywhere, and whether we can give the compiler the desired behavior (quick and clean exit) when given an impossible task, without a lot of implementation complexity. |
I just spun up an aarch64 (c6g.metal) instance to look at codegen, and I noticed that two of our UI tests do not pass:
I can reduce one of the failing tests to this:
If I am on an x86_64 host targeting aarch64-unknown-linux-gnu, using stable or nightly, this will immediately fail to compile with:
If my host is aarch64-unknown-linux-gnu using stable, I also get that diagnostic. But nightly toolchains on an aarch64 host allocate more and more memory, eventually dying with SIGKILL.
Bisection points to #135262 as the cause, which suggests that this is a miscompile.
searched nightlies: from nightly-2024-01-01 to nightly-2025-01-23
regressed nightly: nightly-2025-01-13
searched commit range: eb54a50...48a426e
regressed commit: c0f6a1c
bisected with cargo-bisect-rustc v0.6.9
Host triple: aarch64-unknown-linux-gnu
Reproduce with:
This is worth further investigation. Of course it is plausible that a change to the CI configuration would alter the build artifacts from CI, but bisecting a test failure from a stage1 and stage2 toolchain to a CI change is odd.
The text was updated successfully, but these errors were encountered: