Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cl_ext_buffer_device_address #1159

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pjaaskel
Copy link
Contributor

@pjaaskel pjaaskel commented May 7, 2024

The basic cl_mem buffer API doesn't enable access to the underlying raw pointers in the device memory, preventing its use in host side data structures that need pointer references to objects. This API adds a minimal increment on top of cl_mem that provides such capabilities.

The version 0.1.0 is implemented in PoCL and rusticl for prototyping, but everything's still up for discussion. chipStar is the first client that uses the API.

@karolherbst
Copy link
Contributor

any motivation to get this merged? Or anything else needed to discuss before merging this? Could also try to bring it up at the CL WG if needed.

@pjaaskel
Copy link
Contributor Author

any motivation to get this merged? Or anything else needed to discuss before merging this? Could also try to bring it up at the CL WG if needed.

Yep, this is still being discussed in the WG. I personally think it's useful as is and shouldn't harm anything if merged as it even has 2 implementations now.

xml/cl.xml Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
@pjaaskel
Copy link
Contributor Author

Thanks @SunSerega

@SunSerega
Copy link
Contributor

Alright, and now the problem I found in #1171 is visible here because the Presubmit workflow has been properly rerun.

@karolherbst

This comment was marked as resolved.

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Sep 4, 2024

One thing I'm wondering about is how should clCreateSubBuffer behave when being executed on a bda memory object? Should it fail or succeed? I'm fine with either, just wondering if some of the behavior needs to be formalized.

CL_MEM_DEVICE_PTR_EXT and CL_MEM_DEVICE_PTRS_EXT comes to my mind, which should probably return the address + the sub buffer offset, at least that's how I have implemented it so far.

Yes, this was the idea. I'll add a mention in the next update.

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Sep 9, 2024

Updated according to @karolherbst comments.

Copy link
Contributor

@karolherbst karolherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already implemented the extension fully in rusticl/mesa (including sharing the same address across devices) and I think it's fine, however I'd still urge to address the concerns I have for layered implementations implementing it on top of Vulkan. I've already considered the constraints when implementing it, however I think it's better to provide clients to query if address sharing across multiple devices is supported or not.

extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from b8df46b to 1931416 Compare September 24, 2024 12:26
xml/cl.xml Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
@pjaaskel
Copy link
Contributor Author

@SunSerega thanks!

@pjaaskel
Copy link
Contributor Author

@karolherbst I asked about this in the CL/memory WG. We need to submit CTS tests and this might be good to go then with this one. Do you have good tests in Rusticl side we could use? The test in PoCL is quite basic (and needs to be updated), but can be used as a starting point also.

@karolherbst
Copy link
Contributor

@karolherbst I asked about this in the CL/memory WG. We need to submit CTS tests and this might be good to go then with this one. Do you have good tests in Rusticl side we could use? The test in PoCL is quite basic (and needs to be updated), but can be used as a starting point also.

I haven't written any more tests and only used the one in pocl. But I can probably help out with writing tests here.

Copy link
Contributor

@kpet kpet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this proposal!

extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from 9ad04fb to edcff73 Compare December 6, 2024 15:53
@pjaaskel
Copy link
Contributor Author

pjaaskel commented Dec 6, 2024

Thanks @kpet for the feedback. Implemented most of it.

@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from edcff73 to c07ff8f Compare December 6, 2024 16:06
@pjaaskel pjaaskel requested review from karolherbst and kpet December 6, 2024 16:06
@pjaaskel
Copy link
Contributor Author

pjaaskel commented Dec 6, 2024

After @karolherbst and @kpet are happy with this, we'll implement in PoCL and @franz will add a CTS pull request. Then we can mark this 1.0.0 and merge, I think.

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Jan 2, 2025

This extension could also just disallow this behavior if this sounds too much of an edge case nobody is going to care about.

Just wanted to point out that this is a real possibility, which implementations might have to face supporting both (as I'm now).

This is a good point. I'd just disallow this as if (CG) SVM is supported by the implementation, then this extension is rather pointless as this extension is supposed to be a simplification to CG SVM.

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Jan 2, 2025

...but having said that, if the implementation does support CG SVM, it might still support BDA for legacy/compatibility reasons and in that case the other behavior (the "device ptr" = SVM pointer) would make sense. Other opinions?

@karolherbst
Copy link
Contributor

...but having said that, if the implementation does support CG SVM, it might still support BDA for legacy/compatibility reasons and in that case the other behavior (the "device ptr" = SVM pointer) would make sense. Other opinions?

yeah.. I mean it shouldn't be hard for the impl to simply return the SVM pointer for those BDA allocations, because the cl_mem object wrapping an SVM allocation is probably 100x the amount of work compared to handling this edge case.

The normal host_ptr path can have a different address on the GPU side (e.g. if the host memory couldn't be mapped into the GPUs VM), which I think this extension will also have to clarify, but this guarantee also doesn't exist in the core spec (unless it's an SVM allocation).

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Jan 3, 2025

Right. I'll add the other option, returning the SVM pointer in this case, in the next specification revision.

@pjaaskel
Copy link
Contributor Author

When adding the sentence about SVM, is this also ready to be marked 1.0.0 and merged in (wondering should I do it with the same commit)?

@karolherbst
Copy link
Contributor

No concerns in regards to that from my side. I think from a technical perspective it's in a good shape to land, though I don't want to rule out that more clarifications might be needed once others implement it as well.

@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from f9d2828 to 6e8cbe4 Compare January 15, 2025 09:01
@pjaaskel
Copy link
Contributor Author

I cleaned up the commit history and the history description in the docs and upped it to 1.0.0. The headers generated OK. IMHO we could merge this in and update the CTS next.

Copy link
Contributor

@karolherbst karolherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read through it again as I'm updating my implementation. I left a nit and a comment, but the later shouldn't block this.

api/opencl_runtime_layer.asciidoc Outdated Show resolved Hide resolved
_kernel_ ({clEnqueueNDRangeKernel} and {clEnqueueTask}) until the argument
value is changed by a call to {clSetKernelArgDevicePointerEXT} for _kernel_.
The device pointer can only be used for arguments that are declared to be a
pointer to `global` memory allocated with {clCreateBufferWithProperties} with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it say global and constant or is this all restricted to global memory? I'm fine either way, though it might be good to point it out if it's not supported for constant memory.

For some hardware/implementations it's more or less the same, so might be better to be more explicit about it before one implementation supports it for constant and another doesn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Is it even possible to allocate buffers from the constant space and assign them as arguments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit complicated. Nvidia GPUs have a hardware path for constant memory which doesn't really allow global addresses, however in command submission you'd still just upload from a global address, so it might be better to not allow it.

However modern nvidia GPUs can do the same with bindless constant buffers where the global address can be used (though the size of the entire access would be needed, but the runtime could handle it internally).

I don't think other hardware has a similar restriction as it's often simply a global load instruction with a special caching mode.

Though I think I'm leaning towards not allowing it for now, because it might make it a performance trade of for some implementation.

Copy link
Contributor Author

@pjaaskel pjaaskel Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant from the OpenCL API perspective. Surely the constant AS can be mapped to whatever memory physically in HW if wanted.

Copy link
Contributor

@karolherbst karolherbst Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely the constant AS can be mapped to whatever memory physically in HW if wanted.

It depends. If you make use of Nvidia's const buffers, then the answer is no. Or you add a lot of compiler smartness to make it somehow happen, but it's quite a bit of work. Modern nvidia GPUs (last 5 years) can deal a bit better, but it's still a performance trade-off nonetheless. It's quite a different thing on the ISA level and it's a huge perf gain to not use virtual addresses for constant memory at all as the instruction pulling from constant memory behave more like pulling data from registers instead of VRAM.

There are also push constants in other vendors hardware which could be used to implement the constant AS, and for those similar restrictions apply.

From an API perspective it doesn't matter much however and you assign the same type of buffers to kernel arguments being pointers to the global or constant AS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing special about global and constant there as both simply use cl_mem buffers. Constant in the kernels are a bit different, as you can also have constant pointers point to them, but to set the value of e.g. a __constant float *tmpF kernel argument, you simply call clSetKernelArg(kernel, 0, sizeof(cl_mem), &cl_mem); with a ordinary cl_mem created through clCreateBuffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. This is interesting and a bit weird also: Doesn't it mean that constant and global address spaces cannot be disjoint but actually map to the same address space, otherwise how the device can arbitrate between these address spaces? Some ISAs might even have different instructions for accessing either and constant could be actually a read-only memory in HW.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't it mean that constant and global address spaces cannot be disjoint but actually map to the same address space, otherwise how the device can arbitrate between these address spaces?

it depends on a lot of things. As explained above, there are hardware paths a compiler could use to make the constant address space faster, but at least from a GPU programming perspective those things are initialized either from constant data the host sent or from a buffer in the global address space.

But the semantics in the hardware shaders/kernels are entirely disjoint, so if the hw paths are chosen, you can't access the constant AS with global operations directly (would need to pull the address from somewhere else).

On nvidia it's e.g. 16 or more buffers of 64kb size, and the index is a 0 based vec2 (index + offset) and it's all bindful (meaning you program the individual slots when launching the shader/kernel, so you don't guarantee that the constant address remains stable across invocations at all).

So most of the constant weirdness is just part of the command submission when launching the kernel and can even happen on the GPU independently from the host, e.g. you can write to a global buffer and use it as a hw constant buffer in the next kernel, without the host having to do anything to update the contents of the constant buffer slots as it happens all on the GPU.

OpenCL implementations might also use those hardware buffers for in source constants, especially if they are indirectly accessed or huge tables.

So from an ISA perspective it can look entirely different, while from an API perspective it looks almost the same. However, runtimes could make the global backing storage of a hardware constant buffer available to the kernel, if the compiler needs to access constant data through the global AS (e.g. through an internal driver buffer mapping from constant buffer index to global AS).

There are also instructions with can be hybrid (e.g. on nvidia there is ld.constant or bindless constant buffers needing the global address + the size of the buffer, but that's a relatively new feature), like using global addressing, but making use of the constant buffer hardware and aggressively cache data.

Of course an implementation can also simply use the hw global AS for both API global and API constant if they don't care about the performance benefits (rusticl atm doesn't use the hw constant buffers for kernel arguments, because I haven't gotten to it yet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see now what you mean, thanks. The runtime could optimize in this case and move/allocate the buffer to a constant memory if there is a separate one, the kernel arg qualifier is constant and then ISA in the kernel would always access the constant space if the HW has a disjoint memory for constant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.

With this extension it's easy to use a bda enabledcl_mem object for a constant AS kernel argument as it would simply use whatever constant address the runtime would come up.

Problematic would be if this constant address needs to be the same as the global one or needs to be stable across invocations, because that might be impossible to guarantee if using more specialized hardware paths.

So with the current wording I don't think there is any issue, it might just be better to explicitly state the promises this extension gives here and maybe even make sure in the CTS that implementation doesn't allow more than this extension adds.

api/opencl_runtime_layer.asciidoc Outdated Show resolved Hide resolved
Comment on lines 695 to 697
* {CL_INVALID_DEVICE}
** If _properties_ includes {CL_MEM_DEVICE_PRIVATE_ADDRESS_EXT} and there
is at least one device in the context that doesn't support such allocation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the conditions when a device "doesn't support such (an) allocation"? Is the only relevant condition whether the device supports cl_ext_buffer_device_address, or is there something else?

If this is the only condition then I think it would be clearer to enumerate it explicitly.

I think this also means that it is impossible to use the buffer device address feature in a multi-device context if any of the devices do not support the buffer device address feature - is that correct and is this the desired behavior?

Note that for other OpenCL APIs it is usually only an error if none of the devices in the context support the feature, and the error code in this case is usually CL_INVALID_OPERATION rather than CL_INVALID_DEVICE: for example, creating images, creating programs from IL, etc.

api/opencl_runtime_layer.asciidoc Outdated Show resolved Hide resolved
Comment on lines 6418 to 6419
returns a list of device addresses for the buffer, one for each
device in the context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to say anything about the order in which the device addresses are returned? For example, are the device addresses returned in the same order as the list of devices passed to clCreateContext and/or returned by CL_CONTEXT_DEVICES?

Comment on lines 6439 to 6441
** Returned for the {CL_MEM_DEVICE_ADDRESS_EXT} query if
the {cl_ext_buffer_device_address_EXT} is not supported or if the
buffer was not allocated with {CL_MEM_DEVICE_PRIVATE_ADDRESS_EXT}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider returning CL_INVALID_OPERATION instead of CL_INVALID_VALUE in the case where the buffer was not allocated with CL_MEM_DEVICE_PRIVATE_ADDRESS_EXT. Rationale is: in this scenario the value is valid, it is just invalid in combination with this specific buffer.

* {CL_INVALID_OPERATION} if no devices in the context associated with _kernel_ support
the device pointer.
* {CL_INVALID_ARG_INDEX} if _arg_index_ is not a valid argument index.
* {CL_INVALID_ARG_VALUE} if _arg_value_ specified is not a valid value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to clarify what is considered "not a valid value", especially given the CL_INVALID_OPERATION error condition above.

Is zero considered a valid value, to set a device pointer to NULL?

This discussion for cl_khr_unified_svm could also be relevant: #1282 (comment)

@pjaaskel
Copy link
Contributor Author

Good points @bashbaug, I changed these and made it rev 1.0.1. Pinging @karolherbst @franz

the {CL_MEM_DEVICE_PRIVATE_ADDRESS_EXT} property set to {CL_TRUE},
returns a list of device addresses for the buffer, one for each
device in the context in the same order as the list of devices
passed to {clCreateContext}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that CL_MEM_DEVICE_PRIVATE_ADDRESS_EXT can be passed if some devices don't support the extension, will this query return NULL or an undefined value for devices not supporting it? Or would one have to filter those out leaving no gaps?

Copy link
Contributor Author

@pjaaskel pjaaskel Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it has to be "undefined" because 0 might be a valid value in the device, unless we define it to be not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NULL doesn't imply it being 0, just that it's a NULL pointer, so runtimes could return whatever value represents NULL for that device.

Though I don't mind it being undefined either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but then we'd need to have an API for figuring out what is the "NULL" for each device.. I think it might be simplest and clearest to leave it undefined (likely 0).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, undefined is fine. One could argue that NULL issue exists with clSetKernelArgDevicePointerEXT, but one can use clSetKernelArg to set a NULL pointer, so yeah, probably better not to get into the NULL topic if we can avoid it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and I even expanded the semantics to allow invalid pointers of any kind (like C/C++ calls do), as long as they are not dereferenced.

xml/cl.xml Outdated
<proto><type>cl_int</type> <name>clSetKernelArgDevicePointerEXT</name></proto>
<param><type>cl_kernel</type> <name>kernel</name></param>
<param><type>cl_uint</type> <name>arg_index</name></param>
<param>const <type>cl_mem_device_address_ext</type>* <name>arg_value</name></param>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct, and this doesn't match the function signature in POCL. I think it should be:

Suggested change
<param>const <type>cl_mem_device_address_ext</type>* <name>arg_value</name></param>
<param><type>cl_mem_device_address_ext</type> <name>arg_value</name></param>

It'd probably be a good idea to generate the headers from this XML file to make sure everything looks as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the signature in the XML file does seem to match the function signature in Mesa though.

@karolherbst and @pjaaskel, which one is right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe what you @bashbaug propose is right. I don't see point passing a pointer to the pointer value instead of passing it by value. @karolherbst do you agree?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had a discussion on this and potentially considered passing a list of addresses one per device. And I assumed this change was made, so it's easier to switch to that model when decided so in a future version of this extension.

Though I don't mind either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. But since we don't have the list functionality, but resorted to calling it for each kernel launch, if targeting different devices, maybe it makes sense to revert back to the value passing?

The basic cl_mem buffer API doesn't enable access to the underlying raw
pointers in the device memory, preventing its use in host side
data structures that need pointer references to objects. This API
adds a minimal increment on top of cl_mem that provides such
capabilities.
The only content addition since the previous version is

"If the device supports SVM and {clCreateBufferWithProperties} is
 called with a pointer returned by {clSVMAlloc} as its _host_ptr_
 argument, and {CL_MEM_USE_HOST_PTR} is set in its _flags_ argument,
 the device-side address is guaranteed to match the _host_ptr."
* Made it explicit that passing illegal pointers is legal as long
as they are not referenced.
* Removed CL_INVALID_ARG_VALUE as a possible error in
clSetKernelArgDevicePointerEXT() as there are no illegal pointer
cases when calling this function. Return CL_INVALID_OPERATION for
clGetMemObjectInfo() if the pointer is not a buffer device
pointer.
* clSetKernelExecInfo() and clSetKernelArgDevicePointerEXT() now
only error out if no devices in the context associated with kernel
support device pointers.
Converted the clSetKernelArgDevicePointerEXT() address parameter to
a value instead of a pointer to the value.
@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from 2ef0476 to 8d13bfa Compare February 4, 2025 12:57
@pjaaskel
Copy link
Contributor Author

pjaaskel commented Feb 4, 2025

I believe I've now addressed the feedback. @franz pls check that this 1.0.2 matches the PoCL implementation and the CTS.

@bashbaug
Copy link
Contributor

bashbaug commented Feb 4, 2025

Would it be possible to create a draft PR with headers for this extension, generated from the XML file in this PR? Sometimes I've found that that can be a good way to identify XML issues, and we'll need to do this anyhow before merging the CTS changes.

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Feb 4, 2025

Ah, I was planning to simply attach the generated cl_ext.h but forgot. Would this work:
cl_ext.h.gz

@karolherbst
Copy link
Contributor

Would it be possible to create a draft PR with headers for this extension, generated from the XML file in this PR? Sometimes I've found that that can be a good way to identify XML issues, and we'll need to do this anyhow before merging the CTS changes.

I can attach a diff, because I'll need one for mesa anyway.

@bashbaug
Copy link
Contributor

bashbaug commented Feb 6, 2025

Here's a draft PR with header changes generated from the XML file in this PR: KhronosGroup/OpenCL-Headers#273

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants