Use `multi_draw_indirect_count` where available, in preparation for two-phase occlusion culling. #17211

pcwalton · 2025-01-07T06:24:19Z

This commit allows Bevy to use multi_draw_indirect_count for drawing meshes. The multi_draw_indirect_count feature works just like multi_draw_indirect, but it takes the number of indirect parameters from a GPU buffer rather than specifying it on the CPU.

Currently, the CPU constructs the list of indirect draw parameters with the instance count for each batch set to zero, uploads the resulting buffer to the GPU, and dispatches a compute shader that bumps the instance count for each mesh that survives culling. Unfortunately, this is inefficient when we support multi_draw_indirect_count. Draw commands corresponding to meshes for which all instances were culled will remain present in the list when calling
multi_draw_indirect_count, causing overhead. Proper use of multi_draw_indirect_count requires eliminating these empty draw commands.

To address this inefficiency, this PR makes Bevy fully construct the indirect draw commands on the GPU instead of on the CPU. Instead of writing instance counts to the draw command buffer, the mesh preprocessing shader now writes them to a separate indirect metadata buffer. A second compute dispatch known as the build indirect parameters shader runs after mesh preprocessing and converts the indirect draw metadata into actual indirect draw commands for the GPU. The build indirect parameters shader operates on a batch at a time, rather than an instance at a time, and as such each thread writes only 0 or 1 indirect draw parameters, simplifying the current logic in mesh_preprocessing, which currently has to have special cases for the first mesh in each batch. The build indirect parameters shader emits draw commands in a tightly packed manner, enabling maximally efficient use of multi_draw_indirect_count.

Along the way, this patch switches mesh preprocessing to dispatch one compute invocation per render phase per view, instead of dispatching one compute invocation per view. This is preparation for two-phase occlusion culling, in which we will have two mesh preprocessing stages. In that scenario, the first mesh preprocessing stage must only process opaque and alpha tested objects, so the work items must be separated into those that are opaque or alpha tested and those that aren't. Thus this PR splits out the work items into a separate buffer for each phase. As this patch rewrites so much of the mesh preprocessing infrastructure, it was simpler to just fold the change into this patch instead of deferring it to the forthcoming occlusion culling PR.

Finally, this patch changes mesh preprocessing so that it runs separately for indexed and non-indexed meshes. This is because draw commands for indexed and non-indexed meshes have different sizes and layouts. The existing code is actually broken for non-indexed meshes, as it attempts to overlay the indirect parameters for non-indexed meshes on top of those for indexed meshes. Consequently, right now the parameters will be read incorrectly when multiple non-indexed meshes are multi-drawn together. This is a bug fix and, as with the change to dispatch phases separately noted above, was easiest to include in this patch as opposed to separately.

Migration Guide

Systems that add custom phase items now need to populate the indirect drawing-related buffers. See the specialized_mesh_pipeline example for an example of how this is done.

two-phase occlusion culling. This commit allows Bevy to use `multi_draw_indirect_count` for drawing meshes. The `multi_draw_indirect_count` feature works just like `multi_draw_indirect`, but it takes the number of indirect parameters from a GPU buffer rather than specifying it on the CPU. Currently, the CPU constructs the list of indirect draw parameters with the instance count for each batch set to zero, uploads the resulting buffer to the GPU, and dispatches a compute shader that bumps the instance count for each mesh that survives culling. Unfortunately, this is inefficient when we support `multi_draw_indirect_count`. Draw commands corresponding to meshes for which all instances were culled will remain present in the list when calling `multi_draw_indirect_count`, causing overhead. Proper use of `multi_draw_indirect_count` requires eliminating these empty draw commands. To address this inefficiency, this PR makes Bevy fully construct the indirect draw commands on the GPU instead of on the CPU. Instead of writing instance counts to the draw command buffer, the mesh preprocessing shader now writes them to a separate *indirect metadata buffer*. A second compute dispatch known as the *build indirect parameters* shader runs after mesh preprocessing and converts the indirect draw metadata into actual indirect draw commands for the GPU. The build indirect parameters shader operates on a batch at a time, rather than an instance at a time, and as such each thread writes only 0 or 1 indirect draw parameters, simplifying the current logic in `mesh_preprocessing`, which has to have special cases for the first mesh in each batch. The build indirect parameters shader emits draw commands in a tightly packed manner, enabling maximally efficient use of `multi_draw_indirect_count`. Along the way, this patch switches mesh preprocessing to dispatch one compute invocation per render phase per view, instead of dispatching one compute invocation per view. This is preparation for two-phase occlusion culling, in which we will have two mesh preprocessing stages. In that scenario, the first mesh preprocessing stage must only process opaque and alpha tested objects, so the work items must be separated into those that are opaque or alpha tested and those that aren't. Thus this PR splits out the work items into a separate buffer for each phase. As this patch rewrites so much of the mesh preprocessing infrastructure, it was simpler to just fold the change into this patch instead of deferring it to the forthcoming occlusion culling PR. Finally, this patch changes mesh preprocessing so that it runs separately for indexed and non-indexed meshes. This is because draw commands for indexed and non-indexed meshes have different sizes and layouts. *The existing code is actually broken for non-indexed meshes*, as it attempts to overlay the indirect parameters for non-indexed meshes on top of those for indexed meshes. Consequently, right now the parameters will be read incorrectly when multiple non-indexed meshes are multi-drawn together. *This is a bug fix* and, as with the change to dispatch phases separately noted above, was easiest to include in this patch as opposed to separately.

atlv24

this is good stuff! nice!

crates/bevy_pbr/src/render/mesh_preprocess_types.wgsl

atlv24 · 2025-01-12T22:36:01Z

crates/bevy_pbr/src/render/mesh.rs

+                    mesh_index_slice.range.start,
+                    mesh_index_slice.range.end - mesh_index_slice.range.start,
+                ),
+                None => (false, !0, !0),


why !0 here but 0 above?

I switched it to 0, 0. In an earlier version of this patch, I used an index buffer range starting with !0 to indicate that a mesh was non-indexed, but now there's no need to do that as indexed and non-indexed meshes are kept fully separated throughout the pipeline.

atlv24 · 2025-01-12T22:39:22Z

crates/bevy_pbr/src/render/mesh.rs

+            base_output_index,
+            batch_set_index: match batch_set_index {
+                Some(batch_set_index) => u32::from(batch_set_index),
+                None => !0,


Line 79 of build_indirect_params.wgsl checks this value to see whether the batch belongs to a batch set.

atlv24 · 2025-01-12T22:39:40Z

crates/bevy_sprite/src/mesh2d/mesh.rs

+            mesh_index: input_index,
+            base_output_index,
+            batch_set_index: match batch_set_index {
+                None => !0,


Line 79 of build_indirect_params.wgsl checks this value to see whether the batch belongs to a batch set.

atlv24 · 2025-01-12T22:40:43Z

crates/bevy_pbr/src/render/build_indirect_params.wgsl

+
+    // If this batch belongs to a batch set, then allocate space for the
+    // indirect commands in that batch set.
+    if (batch_set_index != 0xffffffffu) {


I guess this is one of the !0s

Yep, it's the batch_set_index value.

crates/bevy_pbr/src/lib.rs

pcwalton · 2025-01-14T03:39:14Z

Many test failures: https://pixel-eagle.com/project/B25A040A-A980-4602-B90C-D480AB84076D/run/6860/compare/6846

github-actions · 2025-01-14T12:50:33Z

It looks like your PR is a breaking change, but you didn't provide a migration guide.

Could you add some context on what users should update when this change get released in a new version of Bevy?
It will be used to help writing the migration guide for the version. Putting it after a ## Migration Guide will help it get automatically picked up by our tooling.

* The retained view key from bevyengine#16942 was insufficient to uniquely identify a shadow cascade when multiple cameras were present. In such cases, the stable ID for a shadow cascade is actually (light entity, camera entity, cascade index), not (light entity, cascade index) as the PR in bevyengine#16942 assumed. This caused failures in the `camera_sub_view` example. * Sorted phase items didn't push batch sets as they were supposed to. I updated `batch_and_prepare_sorted_render_phase` to do so. This fixes the examples with transparency. * Unbatchable binned entities didn't push batch sets as they were supposed to. This fixes the `morph_targets` example. * As the `GpuPreprocessNode` now runs per camera (a necessary change for occlusion culling), it should only run on views associated with the current camera (the camera itself plus the shadow maps). It was running again for every view, causing failures in tests with multiple views like `split_screen`. * 3D meshes need to be re-extracted if their assets change, so that the first vertex and first index in `MeshInputUniform` are updated. I added a system to do so. Note that this system is somewhat inefficient when meshes change; once `cold-specialization` lands it can be updated to use the asset change infrastructure in that patch to fix the issue. This fixes the `query_gltf_primitives` example. * `specialized_mesh_pipeline` wasn't allocating indirect work items. I changed the example to do so.

pcwalton · 2025-01-14T20:50:29Z

I believe all the regressions are fixed. The ones worth calling out are:

wireframe_2d -- I believe that my PR actually fixes the rendering from main.
texture_atlas -- Locally I can't reproduce the bots' rendering on main. My "before" and "after" are identical to the bot's "after". So I think it's some kind of pre-existing race.
2d_gizmos -- Seems to be a bug fix from main, possibly related to the indexed/non-indexed fix.

pcwalton requested review from IceSentry, tychedelia and Elabajaba January 7, 2025 06:24

pcwalton added A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jan 7, 2025

pcwalton added this to the 0.16 milestone Jan 7, 2025

pcwalton force-pushed the multidraw-indirect-count branch from 22c95b3 to bbebd62 Compare January 7, 2025 06:25

pcwalton added 2 commits January 6, 2025 22:34

Clippy police

42cec3c

Doc check police

5021597

BenjaminBrienen added the D-Complex Quite challenging from either a design or technical perspective. Ask for help! label Jan 7, 2025

pcwalton requested a review from JMS55 January 12, 2025 21:15

pcwalton added 2 commits January 12, 2025 13:27

Merge remote-tracking branch 'origin/main' into multidraw-indirect-count

18cd789

Rustfmt police

b9a34fc

atlv24 approved these changes Jan 12, 2025

View reviewed changes

pcwalton added 2 commits January 12, 2025 16:16

Address review comments

a2c64a9

Lint police

6123157

IceSentry approved these changes Jan 14, 2025

View reviewed changes

crates/bevy_pbr/src/lib.rs Outdated Show resolved Hide resolved

IceSentry added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jan 14, 2025

github-actions bot mentioned this pull request Jan 14, 2025

17211 TheBevyFlock/bevy-example-runner#83

Closed

pcwalton added 2 commits January 13, 2025 18:01

Address review comment

a2aec5a

Merge remote-tracking branch 'origin/main' into multidraw-indirect-count

e614842

pcwalton added S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged and removed S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it labels Jan 14, 2025

rparrett added the M-Needs-Migration-Guide A breaking change to Bevy's public API that needs to be noted in a migration guide label Jan 14, 2025

Merge remote-tracking branch 'origin/main' into multidraw-indirect-count

e3acc61

github-actions bot mentioned this pull request Jan 14, 2025

17211 TheBevyFlock/bevy-example-runner#85

Closed

Ambiguity police

20f6267

alice-i-cecile added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged labels Jan 14, 2025

alice-i-cecile added this pull request to the merge queue Jan 14, 2025

Merged via the queue into bevyengine:main with commit 35101f3 Jan 14, 2025
29 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `multi_draw_indirect_count` where available, in preparation for two-phase occlusion culling. #17211

Use `multi_draw_indirect_count` where available, in preparation for two-phase occlusion culling. #17211

pcwalton commented Jan 7, 2025 •

edited

Loading

atlv24 left a comment

atlv24 Jan 12, 2025

pcwalton Jan 13, 2025

atlv24 Jan 12, 2025

pcwalton Jan 13, 2025

atlv24 Jan 12, 2025

pcwalton Jan 13, 2025

atlv24 Jan 12, 2025

pcwalton Jan 13, 2025

pcwalton commented Jan 14, 2025

github-actions bot commented Jan 14, 2025

pcwalton commented Jan 14, 2025 •

edited

Loading

Use multi_draw_indirect_count where available, in preparation for two-phase occlusion culling. #17211

Use multi_draw_indirect_count where available, in preparation for two-phase occlusion culling. #17211

Conversation

pcwalton commented Jan 7, 2025 • edited Loading

Migration Guide

atlv24 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcwalton commented Jan 14, 2025

github-actions bot commented Jan 14, 2025

pcwalton commented Jan 14, 2025 • edited Loading

Use `multi_draw_indirect_count` where available, in preparation for two-phase occlusion culling. #17211

Use `multi_draw_indirect_count` where available, in preparation for two-phase occlusion culling. #17211

pcwalton commented Jan 7, 2025 •

edited

Loading

pcwalton commented Jan 14, 2025 •

edited

Loading