Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large const allocations no longer lead to graceful errors on aarch64 #135952

Open
saethlin opened this issue Jan 23, 2025 · 11 comments
Open

Large const allocations no longer lead to graceful errors on aarch64 #135952

saethlin opened this issue Jan 23, 2025 · 11 comments
Labels
C-bug Category: This is a bug. I-prioritize Issue: Indicates that prioritization has been requested for this issue. regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@saethlin
Copy link
Member

I just spun up an aarch64 (c6g.metal) instance to look at codegen, and I noticed that two of our UI tests do not pass:

    [ui] tests/ui/consts/large_const_alloc.rs
    [ui] tests/ui/consts/promoted_running_out_of_memory_issue-130687.rs

I can reduce one of the failing tests to this:

#![crate_type = "lib"]
pub const FOO: &[u8] = &[0_u8; (1 << 47) - 1];

If I am on an x86_64 host targeting aarch64-unknown-linux-gnu, using stable or nightly, this will immediately fail to compile with:

error[E0080]: evaluation of constant value failed
 --> demo.rs:2:25
  |
2 | pub const FOO: &[u8] = &[0_u8; (1 << 47) - 1];
  |                         ^^^^^^^^^^^^^^^^^^^^^ tried to allocate more memory than available to compiler

error: aborting due to 1 previous error

If my host is aarch64-unknown-linux-gnu using stable, I also get that diagnostic. But nightly toolchains on an aarch64 host allocate more and more memory, eventually dying with SIGKILL.

Bisection points to #135262 as the cause, which suggests that this is a miscompile.

searched nightlies: from nightly-2024-01-01 to nightly-2025-01-23
regressed nightly: nightly-2025-01-13
searched commit range: eb54a50...48a426e
regressed commit: c0f6a1c

bisected with cargo-bisect-rustc v0.6.9

Host triple: aarch64-unknown-linux-gnu
Reproduce with:

cargo bisect-rustc --script script.sh --start 2024-01-01 

This is worth further investigation. Of course it is plausible that a change to the CI configuration would alter the build artifacts from CI, but bisecting a test failure from a stage1 and stage2 toolchain to a CI change is odd.

@saethlin saethlin added the C-bug Category: This is a bug. label Jan 23, 2025
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jan 23, 2025
@saethlin saethlin added regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Jan 23, 2025
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Jan 23, 2025
@saethlin
Copy link
Member Author

The difference in behavior here is of course that the PR in question also changed the compiler to use jemalloc, which is now passing MAP_NORESERVE to the big mmap call.

I thought we were already using jemalloc on x86_64; my x86_64 desktop has the same logical cores and memory as my c6g.metal instance, as well as the same overcommit settings. So it's quite odd that there is a difference in behavior here.

I wonder if this is somehow related to the variable page size on aarch64?

@saethlin
Copy link
Member Author

but bisecting a test failure from a stage1 and stage2 toolchain to a CI change is odd.

The difference-maker here is definitely enabling jemalloc = true in config.toml, which is one of the things that linked PR does in our dist aarch64 builds.

@workingjubilee
Copy link
Member

cc @mrkajetanp

@mrkajetanp
Copy link
Contributor

two of our UI tests do not pass

We were never able to reproduce this outside of Docker, when running either natively on AArch64 Linux or in qemu AArch64 Linux those tests pass just fine, that's why we filed this under a Docker issue and skipped them in the CI. More investigation needed indeed, thanks for pointing it out!

@workingjubilee
Copy link
Member

That is something that seems to be a valid reason to reject a PR.

@mrkajetanp
Copy link
Contributor

mrkajetanp commented Jan 24, 2025

To clarify, whatever the original problem is it was not introduced with the CI changes in question. We were skipping large_const_alloc inside our internal CI runs ~6 months ago already, and those runs weren't even using jemalloc.
The test suite was just not run on AArch64 on here prior to the CI changes moving the dist build.

I'm a little surprised that the CI changes modify the behaviour there. It'd be nice to confirm that it's definitely switching on jemalloc that's making it behave differently and not something else. What I don't get is, how come the CI changes make the diagnostic not pop-up? Does that specific diagnostic behave differently depending on whether jemalloc is switched on or not?

@WaffleLapkin
Copy link
Member

The diagnostic originates here (this error is later turned into the diagnostic):

let bytes = Bytes::zeroed(size, align).ok_or_else(fail)?;

Since Bytes::zeroed allocates memory, it can be affected by the allocator.

@saethlin
Copy link
Member Author

It'd be nice to confirm that it's definitely switching on jemalloc that's making it behave differently and not something else.

I can confirm that it is definitely switching on jemalloc that makes it behave differently. This is quite easy to check in an aarch64 dev environment, just set jemalloc = true or jemalloc = false in your config.toml then run x test tests/ui/consts/ --force-rerun (changing jemalloc does not invalidate test runs, which seems like yet another bug to me).

I've also validated that I can reproduce this regression on my pinebook as well as an EC2 metal instance. I won't be setting up a rustc dev environment on my pinebook to check that the jemalloc setting specifically can toggle between the behaviors.

Do we know exactly what you were seeing from the tests ~6 months ago? Does it line up with the "use all the system's memory then die with SIGKILL" that I am reporting here?

@adamgemmell
Copy link
Contributor

Do we know exactly what you were seeing from the tests ~6 months ago? Does it line up with the "use all the system's memory then die with SIGKILL" that I am reporting here?

This was just doing a x.py test --stage=1 in gitlab CI as an internal sanity check, not using CI's docker images. I chalked it up to docker behaviour, perhaps from not having a memory limit set in the run command, and since there's not much control over how we can run Docker there I just skipped the tests and moved on with life. I doubt it's related given your reproduction using jemalloc and it greatly predates the dist PR. From what I could tell (restricted to running something in the background to print memory usage every second) yes it basically filled up the available resources and then was killed by the host. I was never able to reproduce using upstream CI at the time on my own machine.

Not be confused with skipping the tests here which is a separate potential issue.

@mrkajetanp
Copy link
Contributor

I might be wrong, but it seems to me like there's an issue with how said diagnostic gets triggered? I.e., in some environments like aarch64 + jemalloc or the Docker environment we had the compiler incorrectly thinks that this is an okay thing to do. Those might just be two different ways to trigger the same issue.

@saethlin
Copy link
Member Author

The diagnostic is a direct response to mmap of a particular size with MAP_NORESERVE failing with ENOMEM. The thing to be debugged here is why that same mmap call with the same overcommit settings and same system memory fails on x86_64 but succeeds on aarch64.

And, hopefully as an outcome of that, a determination of whether this test should be enabled anywhere, and whether we can give the compiler the desired behavior (quick and clean exit) when given an impossible task, without a lot of implementation complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-prioritize Issue: Indicates that prioritization has been requested for this issue. regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants