[libfabric-users] Handling disconnects in verbs; ofi_rxm with pending fi_recv

Carsten Patzke carsten.patzke at desy.de
Tue Mar 24 15:45:09 PDT 2020


Hello list,

I am currently using verbs;ofi_rxm to create a kind of REST API like behavior.
The client can add a server to his address vector and initially sends a handshake
requests to the server so that the server is able to store the host in his av. (Because FI_SOURCE_ERR is not supported)

Now the problem arises when the client sends a request to the server,
but the server somehow crashes and will never send a response back to the client.

The client still has a pending fi_recv running and it is not getting any completion data.

When enabling debug output at the client I can see that underlying provider is aware of the disconnect.

Is there some way to know that the server has disconnected or a way to timeout this?
Otherwise I would need to put a huge effort in creating a scheduler which eventually calls fi_cancel when a certain time has passed.

Stay healthy && best regards,
Carsten Patzke


More information about the Libfabric-users mailing list