Validation error & crash on wgpu Vulkan + Windows #6832

ArthurBrussee · 2024-12-27T13:47:28Z

Description
When running my app (https://github.com/ArthurBrussee/brush), training proceeds steadily for a while, until the app crashes. The symptons seem hard to pin down, it happens fairly randomly. Just before the crash the Vulkan validation layer spits out a bunch of errors about semaphores. Most tellingly some semaphore value seems to be u64::MAX which Vulkan trips over.

This causes a device loss (possibly?) after which wgpu crashes because of #6378, I think.

I have not been able to reproduce this on Metal, not sure about Vulkan + Linux.

Extra materials

Log with validation errors
log.txt

Platform
wgpu (trunk or 23.0 or 23.1 repro), windows 11, Vulkan, 4070 on 566.36.

ArthurBrussee · 2024-12-29T17:52:21Z

Another issue I reported for an early wgpu 23 version might or might not be related: #6279. If nothing else the bisection there also pointed to some locking behaviour.

It also looks similair to #6323 - comptue heavy workload, and I am getting validation errors of the form

VUID-vkResetCommandPool-commandPool-00040(ERROR / SPEC): msgNum: -1254218959 - Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x282c846d320, name = (wgpu internal) Pre Pass, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x282c8100750, type = VK_OBJECT_TYPE_COMMAND_POOL; | MessageID = 0xb53e2331 | vkResetCommandPool():  (VkCommandBuffer 0x282c846d320[(wgpu internal) Pre Pass]) is in use.
The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.3.296.0/windows/1.3-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)
    Objects: 2
        [0] 0x282c846d320, type: 6, name: (wgpu internal) Pre Pass
        [1] 0x282c8100750, type: 25, name: NULL
VUID-vkResetCommandPool-commandPool-00040(ERROR / SPEC): msgNum: -1254218959 - Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x282c8473760, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x282c8100750, type = VK_OBJECT_TYPE_COMMAND_POOL; | MessageID = 0xb53e2331 | vkResetCommandPool():  (VkCommandBuffer 0x282c8473760[]) is in use.
The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.3.296.0/windows/1.3-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)
    Objects: 2
        [0] 0x282c8473760, type: 6, name: NULL
        [1] 0x282c8100750, type: 25, name: NULL
VUID-vkResetCommandPool-commandPool-00040(ERROR / SPEC): msgNum: -1254218959 - Validation Error: [ VUID-vkResetCommandPool-commandPool-00040 ] Object 0: handle = 0x282c8471e50, name = (wgpu internal) Transit, type = VK_OBJECT_TYPE_COMMAND_BUFFER; Object 1: handle = 0x282c8100750, type = VK_OBJECT_TYPE_COMMAND_POOL; | MessageID = 0xb53e2331 | vkResetCommandPool():  (VkCommandBuffer 0x282c8471e50[(wgpu internal) Transit]) is in use.
The Vulkan spec states: All VkCommandBuffer objects allocated from commandPool must not be in the pending state (https://vulkan.lunarg.com/doc/view/1.3.296.0/windows/1.3-extensions/vkspec.html#VUID-vkResetCommandPool-commandPool-00040)

cwfitzgerald · 2025-01-02T20:56:31Z

If a single submission goes longer than 60s, you might see that, if that's not the case I'm not sure wht the issue is on the top of my head.

ArthurBrussee · 2025-01-02T23:50:34Z

It's definitely not going over 60s, the amount of GPU work in the order of ~100ms, and putting a submit() after every submit() call still crashes.

I've tried downgrading to 22.10 but it still seems to crash. I've also tried adding

wgpu-hal = { version = "22.0.0", features = [
    "device_lost_panic",
    "internal_error_panic",
    "oom_panic",
] }

But the stack trace is still

thread 'tokio-runtime-worker' panicked at C:\Users\A-Bru\.cargo\registry\src\index.crates.io-6f17d22bba15001f\wgpu-22.1.0\src\backend\wgpu_core.rs:2314:30:
Error in Queue::submit: Validation Error

Caused by:
  Parent device is lost

With a stacktrace pointing to wherever the last submit was, or other similair traces.

If you have any tips what to try / how to investigate this would be much appreciated!

github-project-automation bot added this to WebGPU for Firefox Dec 27, 2024

github-project-automation bot moved this to Todo in WebGPU for Firefox Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation error & crash on wgpu Vulkan + Windows #6832

Validation error & crash on wgpu Vulkan + Windows #6832

ArthurBrussee commented Dec 27, 2024

ArthurBrussee commented Dec 29, 2024

cwfitzgerald commented Jan 2, 2025

ArthurBrussee commented Jan 2, 2025 •

edited

Loading

Validation error & crash on wgpu Vulkan + Windows #6832

Validation error & crash on wgpu Vulkan + Windows #6832

Comments

ArthurBrussee commented Dec 27, 2024

ArthurBrussee commented Dec 29, 2024

cwfitzgerald commented Jan 2, 2025

ArthurBrussee commented Jan 2, 2025 • edited Loading

ArthurBrussee commented Jan 2, 2025 •

edited

Loading