[libfabric-users] subtle difference between sockets and tcp; ofi_rxm

Biddiscombe, John A. john.biddiscombe at cscs.ch
Fri Apr 30 00:27:18 PDT 2021


Dear list


When I use sockets (deprecated, I know) - if I start a client node before starting the server/master node and send a message from client to master, the fi_send fails with (ret == -FI_ENOENT) and I retry until the master is started, when the message completes as expected and everything is fine.


When I use tcp;ofi_rxm if I start the client node first and send a message to the master, the message fails with (ret == -FI_EAGAIN) and unfortunately, if I keep retrying whilst starting the master node, the message does not ever complete.


This means that when I start N nodes in a script and the master node takes longer to get up and running than one or more of the clients, then the job hangs - which was something I thought was not a problem (since it worked ok with sockets).


Is there anything I can do to make the tcp version behave the same way as the sockets one?


Thanks


JB


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20210430/5ceb1349/attachment.htm>


More information about the Libfabric-users mailing list