Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation error & crash on wgpu Vulkan + Windows #6832

Open
ArthurBrussee opened this issue Dec 27, 2024 · 3 comments
Open

Validation error & crash on wgpu Vulkan + Windows #6832

ArthurBrussee opened this issue Dec 27, 2024 · 3 comments

Comments

@ArthurBrussee
Copy link

Description
When running my app (https://github.com/ArthurBrussee/brush), training proceeds steadily for a while, until the app crashes. The symptons seem hard to pin down, it happens fairly randomly. Just before the crash the Vulkan validation layer spits out a bunch of errors about semaphores. Most tellingly some semaphore value seems to be u64::MAX which Vulkan trips over.

This causes a device loss (possibly?) after which wgpu crashes because of #6378, I think.

I have not been able to reproduce this on Metal, not sure about Vulkan + Linux.

Extra materials

Log with validation errors
log.txt

Platform
wgpu (trunk or 23.0 or 23.1 repro), windows 11, Vulkan, 4070 on 566.36.

@ArthurBrussee
Copy link
Author

Another issue I reported for an early wgpu 23 version might or might not be related: #6279. If nothing else the bisection there also pointed to some locking behaviour.

It also looks similair to #6323 - comptue heavy workload, and I am getting validation errors of the form

VUID-vkResetCommandPool-commandPool-00040(ERROR / SPEC): msgNum: -1254218959 - Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x282c846d320, name = (wgpu internal) Pre Pass, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x282c8100750, type = VK_OBJECT_TYPE_COMMAND_POOL; | MessageID = 0xb53e2331 | vkResetCommandPool():  (VkCommandBuffer 0x282c846d320[(wgpu internal) Pre Pass]) is in use.
The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.3.296.0/windows/1.3-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)
    Objects: 2
        [0] 0x282c846d320, type: 6, name: (wgpu internal) Pre Pass
        [1] 0x282c8100750, type: 25, name: NULL
VUID-vkResetCommandPool-commandPool-00040(ERROR / SPEC): msgNum: -1254218959 - Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x282c8473760, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x282c8100750, type = VK_OBJECT_TYPE_COMMAND_POOL; | MessageID = 0xb53e2331 | vkResetCommandPool():  (VkCommandBuffer 0x282c8473760[]) is in use.
The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.3.296.0/windows/1.3-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)
    Objects: 2
        [0] 0x282c8473760, type: 6, name: NULL
        [1] 0x282c8100750, type: 25, name: NULL
VUID-vkResetCommandPool-commandPool-00040(ERROR / SPEC): msgNum: -1254218959 - Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x282c8471e50, name = (wgpu internal) Transit, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x282c8100750, type = VK_OBJECT_TYPE_COMMAND_POOL; | MessageID = 0xb53e2331 | vkResetCommandPool():  (VkCommandBuffer 0x282c8471e50[(wgpu internal) Transit]) is in use.
The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.3.296.0/windows/1.3-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)

@cwfitzgerald
Copy link
Member

If a single submission goes longer than 60s, you might see that, if that's not the case I'm not sure wht the issue is on the top of my head.

@ArthurBrussee
Copy link
Author

ArthurBrussee commented Jan 2, 2025

It's definitely not going over 60s, the amount of GPU work in the order of ~100ms, and putting a submit() after every submit() call still crashes.

I've tried downgrading to 22.10 but it still seems to crash. I've also tried adding

wgpu-hal = { version = "22.0.0", features = [
    "device_lost_panic",
    "internal_error_panic",
    "oom_panic",
] }

But the stack trace is still

thread 'tokio-runtime-worker' panicked at C:\Users\A-Bru\.cargo\registry\src\index.crates.io-6f17d22bba15001f\wgpu-22.1.0\src\backend\wgpu_core.rs:2314:30:
Error in Queue::submit: Validation Error

Caused by:
  Parent device is lost

With a stacktrace pointing to wherever the last submit was, or other similair traces.

If you have any tips what to try / how to investigate this would be much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants