Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out #33

Open
mehdid opened this issue Oct 18, 2018 · 7 comments

Comments

@mehdid
Copy link

mehdid commented Oct 18, 2018

Using latest released version, any program linked against libpsm2 would take additional 15s before actually running. This happens on machines without hfi1 devices. Simplest way to reproduce this issue is to test a simple program using libpsm2 directly. For example, the one given in #17 or simply by running fi_info. Testing both using the older version doesn't exhibit this behavior.

This bug has been filed in Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910485

@bsmith94
Copy link
Contributor

The problem occurs when the application code invokes psm2_init() while there are no hfi1 devices present on the system. The call chain eventually invokes hfi_wait_for_device() with a timeout of 0. That is interpreted as 15000ms.

@rwmcguir
Copy link
Contributor

We have two concerns, in theory there was no change between the older and newer code in psm2_init(), but need to track and ensure it didn't cause any new issues. PSM2 has always waited for/dev/hfi1* devices to appear if they were not present (as did the Truescale PSM which it was forked from). So first step is we need to look and see if this some version detection issue between libfabric and the newer libpsm2, or if libpsm2 is doing something different we are not expecting.

@bsmith94
Copy link
Contributor

The change is in psm2_hal.c. It is a brand new file. Reference the
initialization loop at line 246.

/* Optimization note:
The following code attempts to initialize two different times:
First time assumes that the driver is already up, and so it attempts to
initialize with the loop control variable: wait, set to 0.
The second time, when wait is set to 1, waits for the driver to come up.
(When the parameter to: hfp_get_num_units() call below is 0,
hfp_get_num_units() does not wait for the driver to come up.
When the parameter is non-zero, the hfp_get_num_units() call below,
will wait for the driver to come up.) */

It seems like this as addressing an edge case of handling dynamic
device creation or an early psm2 process at the expense of the more
common case where the device is created long before the psm2
application executes and psm2 should fail-fast if the device isn't
present.

psm2_ep_open() takes a timeout parameter via psm2_ep_open_opts. If
psm2_init() needs to wait for devices, then it seems like it should
also take a timeout parameter.

@bsmith94
Copy link
Contributor

bsmith94 commented Oct 20, 2018

An approach of least impact might be to honor an environment variable that activates the waiting in psm2_hal.c:246. The default behavior should be to not wait during psm2_init(), which preserves the behavior prior to 11.2.23.

Something like this:

int wait = 0;
int retries = 0;
/* use the psm2 environment call instead of getenv, getenv used  here for simplicity */
if (getenv("PSM2_INIT_WAIT_FOR_DEVICES") != NULL) {
  retries = 1;
}
for (wait = 0;wait<=retries;wait++) {

I've tested this in my environment with the hfi1 either absent or present, and fi_info behaves as expected. When the device is not present, fi_info executes in .1s and it does not report any psm2 providers. When the device is not present and the environment variable is set, fi_info waits 15s. If I bring up hfi1 while fi_info is waiting, fi_info completes execution and reports the psm2 provider.

@bsmith94
Copy link
Contributor

bsmith94 commented Nov 7, 2018

Here is the patch that I used for libpsm2 in debian/sid:

--- a/psm2_hal.c
+++ b/psm2_hal.c
@@ -242,6 +242,13 @@
 		    " instance to successfully initialize))",
 		    PSMI_ENVVAR_LEVEL_USER, PSMI_ENVVAR_TYPE_INT,
 		    (union psmi_envvar_val)PSM_HAL_INSTANCE_ANY_GEN, &env_hi_pref);
+	union psmi_envvar_val env_hal_init_retry; /* Retry HAL init on failure */
+	psmi_getenv("PSM2_HAL_INIT_RETRY",
+		    "If yes, retry HAL initialization upon failure. A retry"
+			" may take up to 15s to timeout. May not be supported in"
+			" future releases.",
+		    PSMI_ENVVAR_LEVEL_USER, PSMI_ENVVAR_TYPE_YESNO,
+		    PSMI_ENVVAR_VAL_NO, &env_hal_init_retry);
 
 	int wait; /* loop control variable */
 	/* Optimization note:
@@ -253,7 +260,8 @@
 	   hfp_get_num_units() does not wait for the driver to come up.
 	   When the parameter is non-zero, the hfp_get_num_units() call below,
 	   will wait for the driver to come up.) */
-	for (wait=0;wait <= 1;wait++)
+	int retries = env_hal_init_retry.e_int == PSMI_ENVVAR_VAL_YES.e_int? 1 : 0;
+	for (wait=0;wait <= retries;wait++)
 	{
 		struct _psmi_hal_instance *p;
 		SLIST_FOREACH(p, &head_hi, next_hi)

@mehdid
Copy link
Author

mehdid commented Nov 29, 2018

Should this bug have been closed with 4c06f90 ?

@bsmith94
Copy link
Contributor

I tested this patch on libpsm2 11.2.68 and it corrects the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants