Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic panic in async-io thread with esp-sys-svc async TLS example #84

Open
jothan opened this issue Mar 21, 2024 · 41 comments
Open

Sporadic panic in async-io thread with esp-sys-svc async TLS example #84

jothan opened this issue Mar 21, 2024 · 41 comments

Comments

@jothan
Copy link

jothan commented Mar 21, 2024

Code: https://github.com/esp-rs/esp-idf-svc/blob/4fff46bba1be66ae3c6945d3b8bda30a589f6f6b/examples/tls_async.rs

Backtrace:
backtrace.txt

Crate versions:
Cargo.lock

I am running the example on a ESPcam board, with an esp32 (revision v3.0) Xtensa module (Tinker AI ESP32-S, to be exact).

The example usually crashes once or twice after reset before working.

I've only managed to reproduce this with an optimized debug build with codegen-units=1.

[profile.de[v]
debug = true
opt-level = "s"
codegen-units = 1
@Vollbrecht
Copy link

I've only managed to reproduce this with an optimized debug build with codegen-units=1.

To be more concrete you never saw it before applying this optimization?

@jothan
Copy link
Author

jothan commented Apr 9, 2024

I've only managed to reproduce this with an optimized debug build with codegen-units=1.

To be more concrete you never saw it before applying this optimization?

To be honest, it seems to depend on code layout. It might trigger in other conditions, I have not managed to narrow it down. It smells like a race condition of some sort.

@notgull
Copy link
Member

notgull commented Apr 9, 2024

cc @ivmarkov Any idea why this would be happening?

@notgull notgull transferred this issue from smol-rs/async-io Apr 10, 2024
@notgull
Copy link
Member

notgull commented Apr 10, 2024

The issue actually appears to be in our locking implementation, specifically OnceCell.

@ivmarkov
Copy link

cc @ivmarkov Any idea why this would be happening?

No idea. Will try to reproduce once I'm back.

@jothan
Copy link
Author

jothan commented Apr 24, 2024

@ivmarkov have you had any luck reproducing this ? Let me know if you need further input and testing from my part.

@ivmarkov
Copy link

Couldn't get to there yet. Sorry about that! Very likely will try to do something related to async-io in the next couple of days.

@npmenard
Copy link

npmenard commented Apr 24, 2024

I am running in the same issue as described by @jothan

I have done some investigation and I believe the issue boils down to the pinning of event_listener In this case the address generated is not 32 bit aligned (ex : 0x3ffe860f not sure if this matters) but more importantly point back to an area of the stack where the first byte is not 0x0. Using GDB i can see that the value it points to is 0x3f (which is the MSB of an address used during the creation of the thread) since the remain bytes are 0 maybe 0xffe8610 would be the correct pinned address?
Interestingly if I add a dummy read before pinning like :

let event_listener = EventListener::new();
let _ = event_listener.is_listening();
pin!(event_listener);

The pinned address will be 32bit aligned pointing to a stack area where all bytes are 0 and the bug doesn't happen.

For reference here are the two assembly dump of the function
not working
working

Hope that helps @ivmarkov

@ivmarkov
Copy link

ivmarkov commented May 1, 2024

@npmenard Thank you for the detailed analysis!

I can reproduce the issue (though I haven't connected with GDB to examine the address), and I confirm that the dummy listener.is_listening(); is fixing the issue for me too - but only in a more trivial case I have. In a more complex setup, I have another crash - something related to thread locals - but let's put that aside for a moment.

(Also, not sure if that matters, but I'm consistently reproducing the issue with opt-levels "s" and "z" (haven't tried with other opt levels yet, will do)).

As for:

I have done some investigation and I believe the issue boils down to the pinning of event_listener In this case the address generated is not 32 bit aligned (ex : 0x3ffe860f not sure if this matters)

I assume by address 0x3ffe860f you mean the &mut EventListener which is actually stored in the Pin struct?
Yes, not being 32-bit aligned sounds suspicious, as long as the Event:Listener contains (recursively) at least one integer type which is 4 bytes. As in i32 or u32. If that's the case, the whole structure needs to be aligned to 4 bytes of course, or else the xtensa cpu would generate a trap.

but more importantly point back to an area of the stack where the first byte is not 0x0. Using GDB i can see that the value it points to is 0x3f (which is the MSB of an address used during the creation of the thread) since the remain bytes are 0 maybe 0xffe8610 would be the correct pinned address?

Even more suspicious. This sounds a bit like a bug in the LLVM xtensa backend.

@MabezDev - sorry for pulling you here, but do you think you and/or Andrei can help? There are assembly dumps attached to the previous comment.
I can also try to simplify and provide a small(er) test case, though with async-io in the picture, it would be an ESP IDF IO example in the end...

@npmenard
Copy link

npmenard commented May 3, 2024

I assume by address 0x3ffe860f you mean the &mut EventListener which is actually stored in the Pin struct?
Yes, not being 32-bit aligned sounds suspicious, as long as the Event:Listener contains (recursively) at least one integer type which is 4 bytes. As in i32 or u32. If that's the case, the whole structure needs to be aligned to 4 bytes of course, or else the xtensa cpu would generate a trap.

It's the address that would be printed when formatting event_listener with {:p} after it was pinned for example :
log::info!( "pin_ev_ptr {:p}", event_listener);
In my case the cpu doesn't generate a trap when loading at an unaligned address. To illustrate see is_listening assembly dump and the registers status when we check self.listener.listener.is_some() (l32i.n a2, a2, 0). Register a2 contains the unaligned address and once the load is executed it will contain 3f when it should be 0.

I have noticed something odd but I am not sure if it will shade some light, whenever the bug occurs the value printed by
log::info!( "pin_ev_ptr {:p} {:x}", event_listener, *(&(&**(&event_listener.pointer))) as *const _ as u32); are different.

Lastly I am attaching a backtrace showing the code path that writes at the location in stack where ultimatley the Pin will point to. It happens during WindowOverflow8 exception is triggered when setting a thread local variable.

Hopefully this helps a bit. Let me know if there is anything else I can do to help.

@jothan
Copy link
Author

jothan commented May 13, 2024

With the new Rust 1.78 for ESP32 (in pre-release in the rust-build project), debug assertions for std are now enabled and seem to catch a similar alignment problem with an assertion in core::ptr::read::precondition_check.

Full backtrace: backtrace.txt

Partial backtrace:

Guru Meditation Error: Core  0 panic'ed (Unhandled debug exception).                                                                                          
Debug exception reason: BREAK instr                                                                                                                           
Core  0 register dump:                                                                                                                                        
PC      : 0x400f0d06  PS      : 0x00060c36  A0      : 0x800f0ab0  A1      : 0x3ffcdfc0                                                                        
0x400f0d06 - core::panicking::panic_nounwind_fmt::runtime                                                                                                     
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/panicking.rs:91                                                                
A2      : 0x3ffce060  A3      : 0x00000040  A4      : 0x000001c0  A5      : 0x3ffce220                                                                        
A6      : 0x00060b23  A7      : 0x00060b20  A8      : 0x800f0d01  A9      : 0x3ffcdfa0                                                                        
A10     : 0x00000000  A11     : 0x00000040  A12     : 0x000000f8  A13     : 0x3ffce4e0                                                                        
A14     : 0x3fff3fc8  A15     : 0x400f0cf4  SAR     : 0x00000018  EXCCAUSE: 0x00000001                                                                        
0x400f0cf4 - core::ptr::read::precondition_check                                                                                                              
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/intrinsics.rs:2797                                                             
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0x00000000                                                                        
                                                                                                                                                              
                                                                                                                                                              
Backtrace: 0x400f0d03:0x3ffcdfc0 0x400f0aad:0x3ffcdfe0 0x400f0beb:0x3ffce5c0 0x400ee62d:0x3ffce860 0x400f14bb:0x3ffce880 0x400f148d:0x3ffce8b0 0x400d6e7e:0x3\
ffce8e0 0x400e3afe:0x3ffce930 0x400decc1:0x3ffced20 0x401a1419:0x3ffcf150 0x400d6a52:0x3ffcf170 0x400d479b:0x3ffcf230 0x400d6450:0x3ffcf350 0x400e82f1:0x3ffc\
f5a0 0x400da0c2:0x3ffcf880 0x400da994:0x3ffcf8a0 0x4011b067:0x3ffcf8d0 0x4011b04e:0x3ffcf8f0 0x40120ff3:0x3ffcf910 0x4012cf40:0x3ffcf930                      
0x400f0d03 - core::ptr::read::precondition_check                                                                                                              
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:1282                                                                
0x400f0aad - core::ptr::read                                                                                                                                  
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:1276                                                                
0x400f0beb - async_lock::once_cell::OnceCell<T>::get_or_try_init_blocking                                                                                     
    at /home/user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-lock-3.3.0/src/once_cell.rs:441                                                  
0x400ee62d - async_lock::once_cell::OnceCell<T>::get_or_init_blocking         

@jothan
Copy link
Author

jothan commented May 13, 2024

Looking at the backtraces, it would seem that OnceCell does something weird that causes EventListener trouble down the line.

@ivmarkov
Copy link

Looking at the backtraces, it would seem that OnceCell does something weird that causes EventListener trouble down the line.

Why would OnceCell be doing something weird and moreover, this something only happens with esp idf + xtensa and not on x86 and (likely) arm? I still find it more likely that the root cause is a miscompilation bug in the llvm xtensa backend.

Will open a bug at the xtensa llvm fork shortly btw.

And will test (again) with riscv32imc on the esp idf, though I'm relatively sure the issue did not happen there.

@jothan
Copy link
Author

jothan commented May 13, 2024

@ivmarkov Doing something weird in this case is triggering the miscompilation. I don't doubt your analysis.

@ivmarkov
Copy link

Other than dereferencing a raw pointer, I don't see anything else weird, unfortunately...

@ivmarkov
Copy link

I know there is little to no progress here, for which I apologize. In the meantime, what I found to work for me is setting the version of async-io back to "2.0.0" (as in async-io = "=2.0.0").

We'll resume the work on finding the root cause shortly.

@notgull
Copy link
Member

notgull commented Jun 1, 2024

Does this issue still occur with async-lock v3.4.0?

@jothan
Copy link
Author

jothan commented Jun 1, 2024

@notgull It still panics in the same spot with debug assertions on (Rust 1.78), however, it does not crash when they are off.

@jothan
Copy link
Author

jothan commented Aug 1, 2024

@ivmarkov Any news on the LLVM front ? I can reproduce unaligned pointers on the latest 1.80 toolchain.

@ivmarkov
Copy link

ivmarkov commented Aug 1, 2024

@ivmarkov Any news on the LLVM front ? I can reproduce unaligned pointers on the latest 1.80 toolchain.

@MabezDev sorry for the ping ^^^

@jothan I know this is not really a proper fix, but you can give async-io-mini a try until a proper fix is in-place. async-io-mini is API-compatible with async-io as long as you don't use the async-io timers - which you anyway probably shouldn't as on the ESP IDF they have a minimum res of 10ms (UPDATE: on a second thought, this is likely not true).

@jothan
Copy link
Author

jothan commented Aug 1, 2024

@ivmarkov Any news on the LLVM front ? I can reproduce unaligned pointers on the latest 1.80 toolchain.

@MabezDev sorry for the ping ^^^

@jothan I know this is not really a proper fix, but you can give async-io-mini a try until a proper fix is in-place. async-io-mini is API-compatible with async-io as long as you don't use the async-io timers - which you anyway probably shouldn't as on the ESP IDF they have a minimum res of 10ms (UPDATE: on a second thought, this is likely not true).

I might give async-io-mini a try, my blocker is mostly my use of smol-hyper (that depends on regular async-io). I am also using some async-io timers, but I can always use some other timers.

@bugadani
Copy link

Did any of you end up opening an issue in https://github.com/espressif/llvm-project ? I can't find it.

@ivmarkov
Copy link

Did any of you end up opening an issue in https://github.com/espressif/llvm-project ? I can't find it.

I didn't.
I really feel uncomfortable opening it and providing details myself, as my knowledge in xtensa is exactly 0%.
I keep pushing though.

Would you like to take it upon yourself? Would really appreciate that and you might be the right person, given your xtensa mis-compilations' history, so to say.

@bugadani
Copy link

I can reliably reproduce a panic with this:

use core::pin::pin;

use esp_idf_svc::io;

// Don't forget to raise the CONFIG_PTHREAD_TASK_STACK_SIZE_DEFAULT in `sdkconfig.defaults` to > 4K so that the
// `async-io` background thread can work fine
pub fn main() {
    esp_idf_svc::sys::link_patches();
    esp_idf_svc::log::EspLogger::initialize_default();
    io::vfs::initialize_eventfd(5).unwrap();

    // You can use `esp_idf_svc::hal::task::block_on` as well
    async_io::block_on(pin!(async move { Result::<_, anyhow::Error>::Ok(()) }));
}
thread 'async-io' panicked at C:\Users\bugad\.rustup\toolchains\esp\lib\rustlib\src\rust\library\core\src\panicking.rs:219:5:
unsafe precondition(s) violated: ptr::read requires that the pointer argument is aligned and non-null

But removing pin! just makes this work.

This sounds a bit different from the report here, so can someone please verify if this is the actual issue?

@ivmarkov
Copy link

ivmarkov commented Sep 11, 2024

I can reliably reproduce a panic with this:

use core::pin::pin;

use esp_idf_svc::io;

// Don't forget to raise the CONFIG_PTHREAD_TASK_STACK_SIZE_DEFAULT in `sdkconfig.defaults` to > 4K so that the
// `async-io` background thread can work fine
pub fn main() {
    esp_idf_svc::sys::link_patches();
    esp_idf_svc::log::EspLogger::initialize_default();
    io::vfs::initialize_eventfd(5).unwrap();

    // You can use `esp_idf_svc::hal::task::block_on` as well
    async_io::block_on(pin!(async move { Result::<_, anyhow::Error>::Ok(()) }));
}
thread 'async-io' panicked at C:\Users\bugad\.rustup\toolchains\esp\lib\rustlib\src\rust\library\core\src\panicking.rs:219:5:
unsafe precondition(s) violated: ptr::read requires that the pointer argument is aligned and non-null

But removing pin! just makes this work.

This sounds a bit different from the report here, so can someone please verify if this is the actual issue?

Is it?
To me, it sounds suspiciously close to this and the following comments.

It is as if the code generated by the pin! macro results in a mis-aligned pointer address, due to mis-compilation?

pin! generates simply:

$crate::pin::Pin::<&mut _> { __pointer: &mut { $value } }

According to the report I quoted, the above code results in:
a) __pointer containing misaligned address UPDATE: or more likely calling Deref on the Pin type
b) This address seems to be wrong (off by 1 byte)

@ivmarkov
Copy link

Actually @npmenard says something even more interesting:

I have noticed something odd but I am not sure if it will shade some light, whenever the bug occurs the value printed by
log::info!( "pin_ev_ptr {:p} {:x}", event_listener, *(&(&**(&event_listener.pointer))) as *const _ as u32); are different.

Sounds like Deref on Pin produces bogus code.

@bugadani
Copy link

Well I guess we can blame pin, but that might also be independent of our issue. What's weird is that pin seems to carbon-copy whatever you give it, which may be slightly problematic for a self-referencing generator.

Whether this is related to an architecture-specific crash... I don't know.

@ivmarkov
Copy link

Well I guess we can blame pin, but that might also be independent of our issue.

I'm not blaming pin. But so far, it seems we can only reproduce the mis-compilation when pin is used. I'm sure the mis-compilation has a bigger blast radius, but so far we have seen it happening when pin is somehow involved.

What's weird is that pin seems to carbon-copy whatever you give it, which may be slightly problematic for a self-referencing generator.

I would not expect anything else given that Rust arrays are Copy? self-referencing generators that you get from async are certainly not, but I think this point is (hopefully) irrelevant for the issue at hand.

@bugadani
Copy link

Why do you expect &mut array to duplicate the array on stack?

Regardless, I can reproduce the issue (using different opt levels, I get very different crashes and occasionally something working), I'm trying to minimize it somehow.

@ivmarkov
Copy link

ivmarkov commented Sep 11, 2024

Why do you expect &mut array to duplicate the array on stack?

Because the pin! macro does not seem to simply expand to a &mut array?
I find this comment relevant (line 1959). It talks about moving, but since arrays are Copy, what happens is copying instead of moving, I guess. Would you agree?

Regardless, I can reproduce the issue (using different opt levels, I get very different crashes and occasionally something working), I'm trying to minimize it somehow.

Good news - if you can minimize it, you would be our hero! :)

@bugadani
Copy link

So, in my case, the root cause of the crash is:

  • Reactor is a 448 byte struct, with alignment 64
  • OnceCell initializes this struct in an async closure, and taking the value out of a Ready is the pointer read where my app crashes
  • The cause of the crash is that Ready and its contents aren't placed on a 64-byte allocated address. (in my case, 0xsomethingsomethingA0, or 0x...30)

Now let's see if I can break a non-idf application with this knowledge :)

@ivmarkov
Copy link

So, in my case, the root cause of the crash is:

  • Reactor is a 448 byte struct, with alignment 64

Alignment of 64 bytes? :o How is that even possible? What might require such an alignment?
I think on xtensa, the maximum alignment is 64 bits which is only 8 bytes?

  • OnceCell initializes this struct in an async closure, and taking the value out of a Ready is the pointer read where my app crashes
  • The cause of the crash is that Ready and its contents aren't placed on a 64-byte allocated address. (in my case, 0xsomethingsomethingA0, or 0x...30)

Yes this is aligned to 64bits, but obviously not to 64 bYtes.

@bugadani
Copy link

bugadani commented Sep 12, 2024

https://doc.rust-lang.org/std/mem/fn.align_of_val.html

Returns the ABI-required minimum alignment of the type of the value that val points to in bytes.

The culprit is ConcurrentQueue, it raises the alignment requirement (with a contained CachePadded type, which is repr(align(64))).

@ivmarkov
Copy link

ivmarkov commented Sep 12, 2024

I think you are getting there!
It might well be, that the LLVM xtensa backend simply does not support such an alignment. Or would that be the linker? Not sure...

@ivmarkov
Copy link

@bugadani Did you already open an issue at https://github.com/espressif/llvm-project ?
Can't see any.

@bugadani
Copy link

I've reported the issue internally

@ivmarkov
Copy link

Not sure what is that internal repo, but... I hope you'll inform us how it is developing :)

@jothan
Copy link
Author

jothan commented Sep 30, 2024

I just read about the atomics-in-psram issue. I am using both cores with PSRAM on an ESP32-cam board. I am boxing some futures on the heap that might end up in PSRAM. I think I'm going to revert to a single threaded executor or disable the second core entirely for stability's sake.

@ivmarkov
Copy link

@bugadani Is there any development in the internal repo that you could share with us all?

@bugadani
Copy link

I don't have any updates.

@bugadani
Copy link

We have identified an issue with LLVM not upholding alignment requirements, and the LLVM team is working on a fix. I don't know if this is the root cause for this issue, but I'll update you with the results after the fix is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants