Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: any dynamic library linking ROCR cannot be dlopen'ed multiple times due to RuntimeCleanup. #275

Open
benvanik opened this issue Jan 8, 2025 · 7 comments

Comments

@benvanik
Copy link
Contributor

benvanik commented Jan 8, 2025

I need to be able to dlopen, use (hsa_init, ...., hsa_shut_down), dlclose, and dlopen the ROCR runtime or a library that statically links in the ROCR runtime. This regressed in 9b13bcd and is now broken: only one initialization of ROCR is allowed per process launch. Previously this worked.

The changes introduced in Runtime::Acquire check that HSA has been cleaned up via a dlclose (global destructors run) and fails with OUT_OF_RESOURCES if it ever has been. This prevents the runtime from ever being dlopen'ed (directly or indirectly when embeddeed) more than once in a process as it's relying on a global initializer to set a non-zero static value and that initializer may never be run. Since HSA has hsa_init and hsa_shut_down for explicit management the additional single-load logic in RuntimeCleanup makes hsa_init/hsa_shut_down not behave as documented unless HSA is statically linked into a top-level executable or only ever dynamically loaded once. This is unfortunate as HSA behavior now differs whether statically linked or dynamically linked and such reinitialization hostile behavior prevents any user of ROCR from ever being reinitialized. Note that this extends beyond just using ROCR as a dynamic library: any other dynamic library that links ROCR in statically now also cannot be reinitialized. Python modules and other systems that use plugins require reinitialization.

The cause is that upon first dlopen (or process init) the global loaded_ flag is set to true as part of a global initializer:

static bool loaded_ = true;

If a process then requests an unload of the ROCR library (or any library with ROCR statically linked into it) with dlclose the global deinitializer will be called and the RuntimeCleanup cleanup_at_load_ destructor will run and reset the loaded flag to false:

Unfortunately the next dlopen (of either ROCR or any library containing it) may only maybe run the initializers again. In most modern cases the dlopen does not run the initializers again. The loaded_ flag remains false since the static bool loaded_ = true; is not re-run and thus the next call to hsa_init fails with OUT_OF_RESOURCES because it is false.

hsa_init and hsa_shut_down are already ref counted - as are libraries managed with dlopen and dlclose - and the current implementation is causing those to conflict. The canonical approach to solving this is to rely only on zero initialization: initializers should always be zero, initialization routines should set their managed values to non-zero values upon explicit use (hsa_init from ref count 0), and they should set their values back to zero upon explicit deinitialization (hsa_shut_down to ref count 0). This allows things like the global ref count to start at 0, transition 0->1 on the first call to hsa_init, transition 1->0 on the last call to hsa_shut_down, and then regardless of whether dlclose->dlopen runs the initializers again still see a ref count of 0.

A dlopen, hsa_init, hsa_shut_down, and dlclose should leave HSA in a state where it can be reinitialized. Code that cannot be reinitialized infects any software that may be loading HSA as they themselves cannot be reinitialized and there's unfortunately no way for any software above HSA to fix this issue.

I suspect the RuntimeCleanup was added for a reason but instead of relying on a non-zero globally initialized value it'd be good to fix any of the globals that were previously assuming non-zero initialization. The library should not be relying on one-shot global initializers and instead handle initialization and cleanup explicitly with the existing hsa_init and hsa_shut_down ref counting behavior. Prior to this change things seemed to work just fine and as expected (ROCR or a library linking it could be reinitialized) - I don't doubt there may have been bugs (C++ static initialization is... not fun :) but those bugs should be fixed instead of preventing reinitialization.

/cc @cfreeamd

@ppanchad-amd
Copy link

Hi @benvanik. Internal ticket has been created to investigate your issue. Thanks!

@benvanik
Copy link
Contributor Author

benvanik commented Jan 8, 2025

if it's helpful, here's a small reproducer:

#include <dlfcn.h>
#include <stdio.h>

typedef unsigned hsa_status_t;
typedef hsa_status_t (*hsa_init_pfn_t)(void);
typedef hsa_status_t (*hsa_shut_down_pfn_t)(void);

static hsa_status_t hsa_lifecycle(void) {
  void* library = dlopen("libhsa-runtime64.so", RTLD_LAZY | RTLD_LOCAL);
  hsa_init_pfn_t hsa_init_ptr = (hsa_init_pfn_t)dlsym(library, "hsa_init");
  hsa_shut_down_pfn_t hsa_shut_down_ptr =
      (hsa_shut_down_pfn_t)dlsym(library, "hsa_shut_down");

  hsa_status_t err = 0;

  err = hsa_init_ptr();
  if (err) return err;

  err = hsa_shut_down_ptr();
  if (err) return err;

  dlclose(library);

  return 0;
}

int main(int argc, char** argv) {
  for (int i = 0; i < 100; ++i) {
    fprintf(stderr, "trying initialization %d...\n", i);
    hsa_status_t err = hsa_lifecycle();
    if (err) {
      fprintf(stderr, "failed to initialize: %u\n", err);
      return err;
    }
  }
  return 0;
}
$ cc -ldl hsa_reinit.c -o hsa_reinit
$ ./hsa_reinit
trying initialization 0...
trying initialization 1...
failed to initialize: 4104

(with 4104 being the HSA_STATUS_ERROR_OUT_OF_RESOURCES as returned by the Runtime::Acquire check)

@cfreeamd
Copy link
Contributor

cfreeamd commented Jan 8, 2025

Hi @benvanik,
The change 9b13bcd fixed several static init issues, but it also introduced a few deinitialization issues. There have been a few other commits to address those.
I think the issue you are seeing may be one that was missed.

The problem in those other cases would occur because when the main thread ends some globals would be destructed even though there are still threads that were referencing those objects. The original assumption was that the objects would last for the duration of the program. The fix is to use dynamic memory allocation to make those objects continue to live after the main thread ends.

It could be the case here that line in runtime.cpp
static RuntimeCleanup cleanup_at_unload_;
should be changed to
static RuntimeCleanup *cleanup_at_unload_ = new RuntimeCleanup;
...or maybe RuntimeCleanup and that line should be deleted altogether as the change would make it pointless. I'm not sure what the point of that class/instantiation is regardless.

loaded was static even before the change that introduced the issue. I'm guessing the change somehow made RuntimeCleanup get called when it never was previously.

Anyway, you could try doing the "new" trick or deleting the instantiation altogether to see if that fixes the issue. If it does, we can decide which of the 2 is the best course of action to address the problem.

@benvanik
Copy link
Contributor Author

benvanik commented Jan 8, 2025

If we made it a pointer like that then it'll never be called (nothing will ever delete it), so it'd be best to delete the whole thing. Otherwise we introduce a new leak (and there are already quite a few leaks - ASAN is very unhappy :).

I think the issue is thinking that there's a "main thread" - as a library, HSA doesn't get to decide what that thread is or its lifetime, and anything it tries to do presuming one exists or that anything HSA related is tied to its lifetime is going to cause problems. I don't see the purpose of RuntimeCleanup given that Runtime::Acquire and Runtime::Release already clean up - it seems like it's just there to hand-hold users who don't properly call hsa_shut_down.

@ypapadop-amd
Copy link
Contributor

If it becomes a pointer, the nifty counter idiom could help in cleanup.

@cfreeamd
Copy link
Contributor

cfreeamd commented Jan 8, 2025

I meant "main" from the POV of ROCr (the thread from which ROCr is initialized).
Some of the leaks you may see from ASAN may not really be leaks--they are meant to fix this problem. That is, they are not supposed to be deleted. Though I'm sure there are legit leaks and I know people have been working on that.

@benvanik
Copy link
Contributor Author

benvanik commented Jan 8, 2025

Any live outstanding allocations after the last hsa_shut_down that drops its ref count to 0 are leaks - it's live memory that is still in the process. You can silence them but it's not a great practice as it hides other issues that can arise (like, say, with reinitialization :).

I'm still not clear what RuntimeCleanup is for; Acquire does

  if (runtime_singleton_ == NULL) {
    memset(log_flags, 0, sizeof(log_flags));
    runtime_singleton_ = new Runtime();
  }

and Release does

  if (runtime_singleton_->ref_count_ == 0) {
    delete runtime_singleton_;
    runtime_singleton_ = nullptr;
  }

There shouldn't be a need to have RuntimeCleanup also do delete Runtime::runtime_singleton_;. If a user properly calls hsa_shut_down runtime_singleton_ will be nullptr (verified) - so it's not doing anything. Having looked more at it the whole thing should be removed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants