Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Potentially unsafe releasing of the GIL in DeviceBuffer #944

Closed
shwina opened this issue Jan 18, 2022 · 12 comments
Closed

[BUG] Potentially unsafe releasing of the GIL in DeviceBuffer #944

shwina opened this issue Jan 18, 2022 · 12 comments
Assignees
Labels
bug Something isn't working inactive-30d inactive-90d Python Related to RMM Python API

Comments

@shwina
Copy link
Contributor

shwina commented Jan 18, 2022

The problem

When allocating memory in the constructor of DeviceBuffer, we release the GIL. See here.

After the allocation, we store a reference to the current memory resource associated with the allocation, so that the memory resource does not get deleted before the DeviceBuffer.

However, as @viclafargue pointed out in an offline discussion, when the GIL is released, another Python thread could change the current memory resource, via a call to set_current_device_resource for example. We can thus end up with a situation where the DeviceBuffer holds a reference to a different memory resource than was used for the allocation. The memory resource that was actually used for the allocation could then be deleted before the DeviceBuffer gets destructed, leading to a segfault during destruction.

Solutions

There are two potential solutions I can think of:

  1. We don't release the GIL in the constructor of DeviceBuffer.

  2. We still release the GIL, but we store a reference to the current memory resource first. Then, we explicitly pass that memory resource to the constructor of the underlying rmm::device_buffer. Something along these lines:

       # Save a reference to the MR and stream used for allocation
       self.mr = get_current_device_resource()
       self.stream = stream
    
       with nogil:
           c_ptr = <const void*>ptr
    
           if size == 0:
               self.c_obj.reset(new device_buffer(self.mr.c_obj.get()))
           elif c_ptr == NULL:
               self.c_obj.reset(new device_buffer(size, stream.view(), self.mr.c_obj.get()))
           else:
               self.c_obj.reset(new device_buffer(c_ptr, size, stream.view(), self.mr.c_obj.get()))
    
               if stream.c_is_default():
                   stream.c_synchronize()

(2) seems preferred, unless I'm overlooking something subtle. Will open a PR to address.

@shwina shwina added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 18, 2022
@shwina shwina added Python Related to RMM Python API and removed ? - Needs Triage Need team to review and classify labels Jan 18, 2022
@shwina shwina self-assigned this Jan 18, 2022
@jakirkham
Copy link
Member

+1 to option 2

@harrism
Copy link
Member

harrism commented Jan 18, 2022

On the C++ side, a device_buffer takes an explicit memory resource in any of its methods that allocates memory (e.g. the constructor), and it stores it so that it uses the same MR for deallocation. In C++ we don't call get_current_device_resource in device_buffer, for exactly this reason.

Is there a way you could make it explicit in Python as well, rather than using get_current_device_resource? If the user wants to use get_current_device_resource, they could call it themselves and pass the returned resource to the device buffer.

@jakirkham
Copy link
Member

I think it makes sense to have the option of specifying a resource to use when allocating a DeviceBuffer (we don't have that currently). Possibly worth doing at the same time.

That said, there is still value in having a default case that falls back to get_current_device_resource. Given DeviceBuffer is used all over the place, changing how it gets called would be too disruptive.

@vyasr
Copy link
Contributor

vyasr commented Jan 18, 2022

I agree, making the memory resource an optional constructor parameter would be useful. I don't think we should get rid of the default behavior though, at least not in the short term.

Option 2 also seems good to me.

@harrism
Copy link
Member

harrism commented Jan 19, 2022

Can we make it look like the C++ API (the default value of the optional parameter is a call to get_current_device_resource)? That way it's very clear from the interface what the default behavior is.

device_buffer(void const* source_data,
std::size_t size,
cuda_stream_view stream,
mr::device_memory_resource* mr = mr::get_current_device_resource())

@jakirkham
Copy link
Member

jakirkham commented Jan 19, 2022

We should be able to put it in the same position as we have tried to follow that signature thus far.

While I understand the suggestion to include get_current_device_resource in the function signature and agree it makes sense conceptually, in Python unfortunately this can result in bad behavior as it is only evaluated once on import. So the typical way to handle this in Python would be to assign None to this argument, then check if it is None and replace that inside the function.

That said, we can clarify this behavior further in the docstring.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@harrism
Copy link
Member

harrism commented Mar 21, 2022

@shwina should we bump to next release?

@shwina
Copy link
Contributor Author

shwina commented Mar 21, 2022

Just did so - thanks!

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@harrism harrism moved this to In Progress in RMM Project Board Jul 20, 2023
@wence-
Copy link
Contributor

wence- commented Apr 4, 2024

This was closed by #1514

@wence- wence- closed this as completed Apr 4, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in RMM Project Board Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working inactive-30d inactive-90d Python Related to RMM Python API
Projects
Status: Done
5 participants