-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: any dynamic library linking ROCR cannot be dlopen'ed multiple times due to RuntimeCleanup
.
#275
Comments
Hi @benvanik. Internal ticket has been created to investigate your issue. Thanks! |
if it's helpful, here's a small reproducer: #include <dlfcn.h>
#include <stdio.h>
typedef unsigned hsa_status_t;
typedef hsa_status_t (*hsa_init_pfn_t)(void);
typedef hsa_status_t (*hsa_shut_down_pfn_t)(void);
static hsa_status_t hsa_lifecycle(void) {
void* library = dlopen("libhsa-runtime64.so", RTLD_LAZY | RTLD_LOCAL);
hsa_init_pfn_t hsa_init_ptr = (hsa_init_pfn_t)dlsym(library, "hsa_init");
hsa_shut_down_pfn_t hsa_shut_down_ptr =
(hsa_shut_down_pfn_t)dlsym(library, "hsa_shut_down");
hsa_status_t err = 0;
err = hsa_init_ptr();
if (err) return err;
err = hsa_shut_down_ptr();
if (err) return err;
dlclose(library);
return 0;
}
int main(int argc, char** argv) {
for (int i = 0; i < 100; ++i) {
fprintf(stderr, "trying initialization %d...\n", i);
hsa_status_t err = hsa_lifecycle();
if (err) {
fprintf(stderr, "failed to initialize: %u\n", err);
return err;
}
}
return 0;
} $ cc -ldl hsa_reinit.c -o hsa_reinit
$ ./hsa_reinit
trying initialization 0...
trying initialization 1...
failed to initialize: 4104 (with 4104 being the |
Hi @benvanik, The problem in those other cases would occur because when the main thread ends some globals would be destructed even though there are still threads that were referencing those objects. The original assumption was that the objects would last for the duration of the program. The fix is to use dynamic memory allocation to make those objects continue to live after the main thread ends. It could be the case here that line in runtime.cpp
Anyway, you could try doing the "new" trick or deleting the instantiation altogether to see if that fixes the issue. If it does, we can decide which of the 2 is the best course of action to address the problem. |
If we made it a pointer like that then it'll never be called (nothing will ever delete it), so it'd be best to delete the whole thing. Otherwise we introduce a new leak (and there are already quite a few leaks - ASAN is very unhappy :). I think the issue is thinking that there's a "main thread" - as a library, HSA doesn't get to decide what that thread is or its lifetime, and anything it tries to do presuming one exists or that anything HSA related is tied to its lifetime is going to cause problems. I don't see the purpose of RuntimeCleanup given that Runtime::Acquire and Runtime::Release already clean up - it seems like it's just there to hand-hold users who don't properly call hsa_shut_down. |
If it becomes a pointer, the nifty counter idiom could help in cleanup. |
I meant "main" from the POV of ROCr (the thread from which ROCr is initialized). |
Any live outstanding allocations after the last hsa_shut_down that drops its ref count to 0 are leaks - it's live memory that is still in the process. You can silence them but it's not a great practice as it hides other issues that can arise (like, say, with reinitialization :). I'm still not clear what RuntimeCleanup is for; Acquire does if (runtime_singleton_ == NULL) {
memset(log_flags, 0, sizeof(log_flags));
runtime_singleton_ = new Runtime();
} and Release does if (runtime_singleton_->ref_count_ == 0) {
delete runtime_singleton_;
runtime_singleton_ = nullptr;
} There shouldn't be a need to have RuntimeCleanup also do |
I need to be able to dlopen, use (hsa_init, ...., hsa_shut_down), dlclose, and dlopen the ROCR runtime or a library that statically links in the ROCR runtime. This regressed in 9b13bcd and is now broken: only one initialization of ROCR is allowed per process launch. Previously this worked.
The changes introduced in
Runtime::Acquire
check that HSA has been cleaned up via a dlclose (global destructors run) and fails withOUT_OF_RESOURCES
if it ever has been. This prevents the runtime from ever being dlopen'ed (directly or indirectly when embeddeed) more than once in a process as it's relying on a global initializer to set a non-zero static value and that initializer may never be run. Since HSA hashsa_init
andhsa_shut_down
for explicit management the additional single-load logic inRuntimeCleanup
makeshsa_init
/hsa_shut_down
not behave as documented unless HSA is statically linked into a top-level executable or only ever dynamically loaded once. This is unfortunate as HSA behavior now differs whether statically linked or dynamically linked and such reinitialization hostile behavior prevents any user of ROCR from ever being reinitialized. Note that this extends beyond just using ROCR as a dynamic library: any other dynamic library that links ROCR in statically now also cannot be reinitialized. Python modules and other systems that use plugins require reinitialization.The cause is that upon first dlopen (or process init) the global
loaded_
flag is set to true as part of a global initializer:ROCR-Runtime/runtime/hsa-runtime/core/runtime/runtime.cpp
Line 114 in 2cc279d
If a process then requests an unload of the ROCR library (or any library with ROCR statically linked into it) with dlclose the global deinitializer will be called and the
RuntimeCleanup cleanup_at_load_
destructor will run and reset the loaded flag to false:ROCR-Runtime/runtime/hsa-runtime/core/runtime/runtime.cpp
Line 125 in 2cc279d
Unfortunately the next dlopen (of either ROCR or any library containing it) may only maybe run the initializers again. In most modern cases the dlopen does not run the initializers again. The
loaded_
flag remains false since thestatic bool loaded_ = true;
is not re-run and thus the next call tohsa_init
fails withOUT_OF_RESOURCES
because it is false.hsa_init
andhsa_shut_down
are already ref counted - as are libraries managed with dlopen and dlclose - and the current implementation is causing those to conflict. The canonical approach to solving this is to rely only on zero initialization: initializers should always be zero, initialization routines should set their managed values to non-zero values upon explicit use (hsa_init from ref count 0), and they should set their values back to zero upon explicit deinitialization (hsa_shut_down to ref count 0). This allows things like the global ref count to start at 0, transition 0->1 on the first call to hsa_init, transition 1->0 on the last call to hsa_shut_down, and then regardless of whether dlclose->dlopen runs the initializers again still see a ref count of 0.A dlopen, hsa_init, hsa_shut_down, and dlclose should leave HSA in a state where it can be reinitialized. Code that cannot be reinitialized infects any software that may be loading HSA as they themselves cannot be reinitialized and there's unfortunately no way for any software above HSA to fix this issue.
I suspect the
RuntimeCleanup
was added for a reason but instead of relying on a non-zero globally initialized value it'd be good to fix any of the globals that were previously assuming non-zero initialization. The library should not be relying on one-shot global initializers and instead handle initialization and cleanup explicitly with the existinghsa_init
andhsa_shut_down
ref counting behavior. Prior to this change things seemed to work just fine and as expected (ROCR or a library linking it could be reinitialized) - I don't doubt there may have been bugs (C++ static initialization is... not fun :) but those bugs should be fixed instead of preventing reinitialization./cc @cfreeamd
The text was updated successfully, but these errors were encountered: