<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>
</head>
<body dir="ltr">
<div id="divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif;" dir="ltr">
<p>Dear list</p>
<p><br>
</p>
<p>When I use sockets (deprecated, I know) - if I start a client node before starting the server/master node and send a message from client to master, the fi_send fails with (ret == -FI_ENOENT) and I retry until the master is started, when the message completes
as expected and everything is fine.</p>
<p><br>
</p>
<p>When I use tcp;ofi_rxm if I start the client node first and send a message to the master, the message fails with
<span>(ret == -FI_EAGAIN)</span> and unfortunately, if I keep retrying whilst starting the master node, the message does not ever complete.
<br>
</p>
<p><br>
</p>
<p>This means that when I start N nodes in a script and the master node takes longer to get up and running than one or more of the clients, then the job hangs - which was something I thought was not a problem (since it worked ok with sockets).</p>
<p><br>
</p>
<p>Is there anything I can do to make the tcp version behave the same way as the sockets one?</p>
<p><br>
</p>
<p>Thanks</p>
<p><br>
</p>
<p>JB<br>
</p>
<div><br>
</div>
<br>
<p></p>
</div>
</body>
</html>