Large const allocations no longer lead to graceful errors on aarch64 #135952

saethlin · 2025-01-23T20:15:39Z

I just spun up an aarch64 (c6g.metal) instance to look at codegen, and I noticed that two of our UI tests do not pass:

    [ui] tests/ui/consts/large_const_alloc.rs
    [ui] tests/ui/consts/promoted_running_out_of_memory_issue-130687.rs

I can reduce one of the failing tests to this:

#![crate_type = "lib"]
pub const FOO: &[u8] = &[0_u8; (1 << 47) - 1];

If I am on an x86_64 host targeting aarch64-unknown-linux-gnu, using stable or nightly, this will immediately fail to compile with:

error[E0080]: evaluation of constant value failed
 --> demo.rs:2:25
  |
2 | pub const FOO: &[u8] = &[0_u8; (1 << 47) - 1];
  |                         ^^^^^^^^^^^^^^^^^^^^^ tried to allocate more memory than available to compiler

error: aborting due to 1 previous error

If my host is aarch64-unknown-linux-gnu using stable, I also get that diagnostic. But nightly toolchains on an aarch64 host allocate more and more memory, eventually dying with SIGKILL.

Bisection points to #135262 as the cause, which suggests that this is a miscompile.

searched nightlies: from nightly-2024-01-01 to nightly-2025-01-23
regressed nightly: nightly-2025-01-13
searched commit range: eb54a50...48a426e
regressed commit: c0f6a1c

bisected with cargo-bisect-rustc v0.6.9

Host triple: aarch64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc --script script.sh --start 2024-01-01

This is worth further investigation. Of course it is plausible that a change to the CI configuration would alter the build artifacts from CI, but bisecting a test failure from a stage1 and stage2 toolchain to a CI change is odd.

The text was updated successfully, but these errors were encountered:

saethlin · 2025-01-23T20:32:32Z

The difference in behavior here is of course that the PR in question also changed the compiler to use jemalloc, which is now passing MAP_NORESERVE to the big mmap call.

I thought we were already using jemalloc on x86_64; my x86_64 desktop has the same logical cores and memory as my c6g.metal instance, as well as the same overcommit settings. So it's quite odd that there is a difference in behavior here.

I wonder if this is somehow related to the variable page size on aarch64?

saethlin · 2025-01-23T20:40:53Z

but bisecting a test failure from a stage1 and stage2 toolchain to a CI change is odd.

The difference-maker here is definitely enabling jemalloc = true in config.toml, which is one of the things that linked PR does in our dist aarch64 builds.

workingjubilee · 2025-01-23T23:45:41Z

cc @mrkajetanp

mrkajetanp · 2025-01-24T00:30:02Z

two of our UI tests do not pass

We were never able to reproduce this outside of Docker, when running either natively on AArch64 Linux or in qemu AArch64 Linux those tests pass just fine, that's why we filed this under a Docker issue and skipped them in the CI. More investigation needed indeed, thanks for pointing it out!

workingjubilee · 2025-01-24T00:37:46Z

That is something that seems to be a valid reason to reject a PR.

mrkajetanp · 2025-01-24T01:00:56Z

To clarify, whatever the original problem is it was not introduced with the CI changes in question. We were skipping large_const_alloc inside our internal CI runs ~6 months ago already, and those runs weren't even using jemalloc.
The test suite was just not run on AArch64 on here prior to the CI changes moving the dist build.

I'm a little surprised that the CI changes modify the behaviour there. It'd be nice to confirm that it's definitely switching on jemalloc that's making it behave differently and not something else. What I don't get is, how come the CI changes make the diagnostic not pop-up? Does that specific diagnostic behave differently depending on whether jemalloc is switched on or not?

WaffleLapkin · 2025-01-24T01:14:13Z

The diagnostic originates here (this error is later turned into the diagnostic):

rust/compiler/rustc_middle/src/mir/interpret/allocation.rs

Line 304 in 99768c8

let bytes = Bytes::zeroed(size, align).ok_or_else(fail)?;

Since Bytes::zeroed allocates memory, it can be affected by the allocator.

saethlin · 2025-01-24T01:53:30Z

It'd be nice to confirm that it's definitely switching on jemalloc that's making it behave differently and not something else.

I can confirm that it is definitely switching on jemalloc that makes it behave differently. This is quite easy to check in an aarch64 dev environment, just set jemalloc = true or jemalloc = false in your config.toml then run x test tests/ui/consts/ --force-rerun (changing jemalloc does not invalidate test runs, which seems like yet another bug to me).

I've also validated that I can reproduce this regression on my pinebook as well as an EC2 metal instance. I won't be setting up a rustc dev environment on my pinebook to check that the jemalloc setting specifically can toggle between the behaviors.

Do we know exactly what you were seeing from the tests ~6 months ago? Does it line up with the "use all the system's memory then die with SIGKILL" that I am reporting here?

adamgemmell · 2025-01-24T11:13:36Z

Do we know exactly what you were seeing from the tests ~6 months ago? Does it line up with the "use all the system's memory then die with SIGKILL" that I am reporting here?

This was just doing a x.py test --stage=1 in gitlab CI as an internal sanity check, not using CI's docker images. I chalked it up to docker behaviour, perhaps from not having a memory limit set in the run command, and since there's not much control over how we can run Docker there I just skipped the tests and moved on with life. I doubt it's related given your reproduction using jemalloc and it greatly predates the dist PR. From what I could tell (restricted to running something in the background to print memory usage every second) yes it basically filled up the available resources and then was killed by the host. I was never able to reproduce using upstream CI at the time on my own machine.

Not be confused with skipping the tests here which is a separate potential issue.

mrkajetanp · 2025-01-24T11:20:10Z

I might be wrong, but it seems to me like there's an issue with how said diagnostic gets triggered? I.e., in some environments like aarch64 + jemalloc or the Docker environment we had the compiler incorrectly thinks that this is an okay thing to do. Those might just be two different ways to trigger the same issue.

saethlin · 2025-01-24T12:47:48Z

The diagnostic is a direct response to mmap of a particular size with MAP_NORESERVE failing with ENOMEM. The thing to be debugged here is why that same mmap call with the same overcommit settings and same system memory fails on x86_64 but succeeds on aarch64.

And, hopefully as an outcome of that, a determination of whether this test should be enabled anywhere, and whether we can give the compiler the desired behavior (quick and clean exit) when given an impossible task, without a lot of implementation complexity.

saethlin added the C-bug Category: This is a bug. label Jan 23, 2025

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jan 23, 2025

rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Jan 23, 2025

This was referenced Jan 23, 2025

Compiler crashes with SIGSEGV on aarch64-unknown-linux-gnu #135867

Open

ci: Enable opt-dist for dist-aarch64-linux builds #133807

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large const allocations no longer lead to graceful errors on aarch64 #135952

Large const allocations no longer lead to graceful errors on aarch64 #135952

saethlin commented Jan 23, 2025

saethlin commented Jan 23, 2025

saethlin commented Jan 23, 2025

workingjubilee commented Jan 23, 2025

mrkajetanp commented Jan 24, 2025

workingjubilee commented Jan 24, 2025

mrkajetanp commented Jan 24, 2025 •

edited

Loading

WaffleLapkin commented Jan 24, 2025

saethlin commented Jan 24, 2025

adamgemmell commented Jan 24, 2025

mrkajetanp commented Jan 24, 2025

saethlin commented Jan 24, 2025

Large const allocations no longer lead to graceful errors on aarch64 #135952

Large const allocations no longer lead to graceful errors on aarch64 #135952

Comments

saethlin commented Jan 23, 2025

saethlin commented Jan 23, 2025

saethlin commented Jan 23, 2025

workingjubilee commented Jan 23, 2025

mrkajetanp commented Jan 24, 2025

workingjubilee commented Jan 24, 2025

mrkajetanp commented Jan 24, 2025 • edited Loading

WaffleLapkin commented Jan 24, 2025

saethlin commented Jan 24, 2025

adamgemmell commented Jan 24, 2025

mrkajetanp commented Jan 24, 2025

saethlin commented Jan 24, 2025

mrkajetanp commented Jan 24, 2025 •

edited

Loading