-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ofi+psm2 client reconnection issues #270
Comments
Hmm thanks for reporting that. I think @j-xiong who maintains the PSM2 provider would be able to advise. |
It's a little bit weird to see failure in the shm path while the client and the server are on different node. Is a full stack trace available at the point of the assertion failure? |
This is not weird at all... psm2 library has its own implementation of shared memory that is enabled by default. As I already said, the assertion error is in that library. I've also opened an issue on their official repository: cornelisnetworks/opa-psm2#34 In any case I've discovered that we have the same issue even if the server and the client are on the very same node.
I'll try to provide one, but in any case is very easy to reproduce. You just need to issue two subsequent RPC to the same psm2 server from two different clients on the same node. |
Yes I understand the assertion is inside the psm2 library. Normally the shared memory path is not supposed to be reached if connections only happen between different nodes. Did the server happen to also talk to other clients during the test?
That is a useful information.
I don't have a ready to use setup for mercury, nor have I used one before. So it might be simpler if the trace is available. |
Yes we discovered that the error is triggered when a second client try We managed to avoid the error by not using psm2 for local communication. So if I understood correctly this is probably an error in the psm2 |
Hi @ael-code , we got the same problem recently. But we were running client and server that are separated on different nodes. |
Describe the bug
After the first client disconnect from server the subsequent ones that try to connect to the same server will trigger an error on the server, in particular this is an assertion error on the psm2 library:
To Reproduce
Platform (please complete the following information):
master branch
6d9bfa0gcc 8.1.0
11.2.68
1.7.0
and1.6.2
(both versions lead to the same error)The text was updated successfully, but these errors were encountered: