-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out #33
Comments
The problem occurs when the application code invokes psm2_init() while there are no hfi1 devices present on the system. The call chain eventually invokes hfi_wait_for_device() with a timeout of 0. That is interpreted as 15000ms. |
We have two concerns, in theory there was no change between the older and newer code in psm2_init(), but need to track and ensure it didn't cause any new issues. PSM2 has always waited for/dev/hfi1* devices to appear if they were not present (as did the Truescale PSM which it was forked from). So first step is we need to look and see if this some version detection issue between libfabric and the newer libpsm2, or if libpsm2 is doing something different we are not expecting. |
The change is in psm2_hal.c. It is a brand new file. Reference the /* Optimization note: It seems like this as addressing an edge case of handling dynamic psm2_ep_open() takes a timeout parameter via psm2_ep_open_opts. If |
An approach of least impact might be to honor an environment variable that activates the waiting in psm2_hal.c:246. The default behavior should be to not wait during psm2_init(), which preserves the behavior prior to 11.2.23. Something like this:
I've tested this in my environment with the hfi1 either absent or present, and fi_info behaves as expected. When the device is not present, fi_info executes in .1s and it does not report any psm2 providers. When the device is not present and the environment variable is set, fi_info waits 15s. If I bring up hfi1 while fi_info is waiting, fi_info completes execution and reports the psm2 provider. |
Here is the patch that I used for libpsm2 in debian/sid:
|
Should this bug have been closed with 4c06f90 ? |
I tested this patch on libpsm2 11.2.68 and it corrects the problem. |
Using latest released version, any program linked against libpsm2 would take additional 15s before actually running. This happens on machines without hfi1 devices. Simplest way to reproduce this issue is to test a simple program using libpsm2 directly. For example, the one given in #17 or simply by running
fi_info
. Testing both using the older version doesn't exhibit this behavior.This bug has been filed in Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910485
The text was updated successfully, but these errors were encountered: