Add API to create queue in device memory #284

atgutier · 2025-01-22T23:18:44Z

Adds an API to create a user-mode queue in device memory.

Now the queue struct is also allocated in device memory.

A flag is added to the API to specify whether or not the memory should be cacheable.

atgutier · 2025-01-22T23:21:06Z

FYI @benvanik this is still WIP, but let me know if this works for you and if the cacheable flag works/has benefits.

runtime/hsa-runtime/core/common/hsa_table_interface.cpp

runtime/hsa-runtime/core/runtime/hsa_api_trace.cpp

runtime/hsa-runtime/inc/hsa_ext_amd.h

amd-jmacaran · 2025-01-23T16:10:10Z

/AzurePipelines run

azure-pipelines · 2025-01-23T16:10:25Z

Azure Pipelines successfully started running 1 pipeline(s).

atgutier · 2025-01-24T23:28:16Z

runtime/hsa-runtime/core/runtime/amd_blit_kernel.cpp

@@ -1272,7 +1272,7 @@ void BlitKernel::PopulateQueue(uint64_t index, uint64_t code_handle, void* args,
  std::atomic_thread_fence(std::memory_order_acquire);
  queue_buffer[index & queue_bitmask_] = packet;
  std::atomic_thread_fence(std::memory_order_release);
-  if (core::Runtime::runtime_singleton_->flag().dev_mem_queue() && !queue_->needsPcieOrdering()) {
+  if (queue_->IsDeviceMem() && !queue_->needsPcieOrdering()) {


@saleelk I'm curious if the logic is correct for the original needsPcieOrdering() method you added. Shouldn't this really be:

queue_->needsPcieOrdering()

Meaning we need to change the logic of that call internally?

dayatsin-amd

Looks good to me. Thank you!

This builds on a prior change that allowed for allocating a user-mode queue's packet buffer in device memory to also allocate the queue struct in device memory. This provides additional latency benefits particularly for cases where dispatches are performed from the GPU itself. An AMD extension API for queue creation is added to specify if the queue should be created in device memory and whether or not it should be cacheable. This is for use by higher level libraries when doing queue creation. This provides the added benefit of allowing device memory to be used on a per-queue basis.

saleelk · 2025-01-29T22:39:23Z

runtime/hsa-runtime/inc/hsa_ext_amd.h

+   * The queue packet buffer and the queue struct should be allocated in
+   * the agent's device memory.
+   */
+  HSA_AMD_QUEUE_FLAG_DEVICE_MEM = (1 << 0),


We should explicitly mention whether its cached on uncached. Like HSA_AMD_QUEUE_HOST = 0, HSA_AMD_QUEUE_DEV_UNCACHED = 1 << 0,

saleelk

the change loosk good except for the flags I mentioned and if we should expose them, and thanks for fixing the typo I had.

saleelk · 2025-01-29T22:42:11Z

runtime/hsa-runtime/inc/hsa_ext_amd.h

+   * Used to indicate if the queue created in device memory should be
+   * cacheable.
+   */
+  HSA_AMD_QUEUE_FLAG_CACHEABLE = (1 << 1),


We should probably discuss if this is even doable at the moment, to flush L2 is costly and if we are writing a packet (aka 64bytes), we can probably also have false sharing issues/

atgutier added the Feature Request label Jan 22, 2025

atgutier linked an issue Jan 22, 2025 that may be closed by this pull request

Add API flag for whether an AQL queue should be allocated in device memory. #269

Open

atgutier requested review from dayatsin-amd and saleelk January 22, 2025 23:20

dayatsin-amd requested changes Jan 23, 2025

View reviewed changes

runtime/hsa-runtime/core/common/hsa_table_interface.cpp Show resolved Hide resolved

runtime/hsa-runtime/core/runtime/hsa_api_trace.cpp Outdated Show resolved Hide resolved

runtime/hsa-runtime/inc/hsa_ext_amd.h Show resolved Hide resolved

dayatsin-amd force-pushed the amd-staging branch from ea66d58 to 9971e7b Compare January 24, 2025 03:20

atgutier force-pushed the atgutier/queue-struct-dev-mem branch from 76a3db2 to fe3307c Compare January 24, 2025 23:25

atgutier commented Jan 24, 2025

View reviewed changes

atgutier force-pushed the atgutier/queue-struct-dev-mem branch from fe3307c to 6bbccb0 Compare January 28, 2025 21:21

atgutier requested a review from dayatsin-amd January 28, 2025 21:21

atgutier force-pushed the atgutier/queue-struct-dev-mem branch from 6bbccb0 to 436b64a Compare January 28, 2025 21:23

dayatsin-amd approved these changes Jan 28, 2025

View reviewed changes

atgutier force-pushed the atgutier/queue-struct-dev-mem branch 8 times, most recently from 1825907 to 51dc622 Compare January 29, 2025 00:33

atgutier added 2 commits January 29, 2025 12:25

rocr: Remove empty shared.cpp

b776ea2

atgutier force-pushed the atgutier/queue-struct-dev-mem branch from 51dc622 to 4533c8d Compare January 29, 2025 20:25

atgutier requested a review from dayatsin-amd January 29, 2025 20:28

saleelk reviewed Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add API to create queue in device memory #284

Add API to create queue in device memory #284

atgutier commented Jan 22, 2025

atgutier commented Jan 22, 2025

amd-jmacaran commented Jan 23, 2025

azure-pipelines bot commented Jan 23, 2025

atgutier Jan 24, 2025

dayatsin-amd left a comment

saleelk Jan 29, 2025

saleelk left a comment

saleelk Jan 29, 2025

Add API to create queue in device memory #284

Are you sure you want to change the base?

Add API to create queue in device memory #284

Conversation

atgutier commented Jan 22, 2025

atgutier commented Jan 22, 2025

amd-jmacaran commented Jan 23, 2025

azure-pipelines bot commented Jan 23, 2025

atgutier Jan 24, 2025

Choose a reason for hiding this comment

dayatsin-amd left a comment

Choose a reason for hiding this comment

saleelk Jan 29, 2025

Choose a reason for hiding this comment

saleelk left a comment

Choose a reason for hiding this comment

saleelk Jan 29, 2025

Choose a reason for hiding this comment