<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.2722" name=GENERATOR></HEAD>
<BODY>
<DIV dir=ltr align=left><FONT face=Arial color=#0000ff
size=2></FONT> </DIV><BR>
<BLOCKQUOTE
style="PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: #0000ff 2px solid; MARGIN-RIGHT: 0px">
<DIV class=OutlookMessageHeader lang=en-us dir=ltr align=left>
<HR tabIndex=-1>
<FONT face=Tahoma size=2><B>From:</B> openib-general-bounces@openib.org
[mailto:openib-general-bounces@openib.org] <B>On Behalf Of </B>Michael
Krause<BR><B>Sent:</B> Tuesday, November 08, 2005 1:08 PM<BR><B>To:</B> Ranjit
Pandit<BR><B>Cc:</B> openib-general@openib.org<BR><B>Subject:</B> Re:
[openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to
OpenIB<BR></FONT><BR></DIV>
<DIV></DIV><FONT size=3>At 12:33 PM 11/8/2005, Ranjit Pandit wrote:<BR>
<BLOCKQUOTE class=cite cite="" type="cite">> Mike wrote:<BR>> -
RDS does not solve a set of failure models. For example, if a RNIC /
HCA<BR>> were to fail, then one cannot simply replay the operations on
another RNIC /<BR>> HCA without extracting state, etc. and providing some
end-to-end sync of<BR>> what was really sent / received by the
application. Yes, one can recover<BR>> from cable or switch port
failure by using APM style recovery but that is<BR>> only one class of
faults. The harder faults either result in the end node<BR>> being
cast out of the cluster or see silent data corruption unless<BR>>
additional steps are taken to transparently recover - again app
writers<BR>> don't want to solve the hard problems; they want that done
for them.<BR><BR>The current reference implementation of RDS solves the HCA
failure case as well.<BR>Since applications don't need to keep connection
states, it's easier<BR>to handle cases like HCA and intermediate path
failures.<BR>As far as application is concerned, every sendmsg 'could'
result in a<BR>new connection setup in the driver.<BR>If the current path
fails, RDS reestablishes a connection, if<BR>available, on a different port
or a different HCA , and replays the<BR>failed messages.<BR>Using APM is not
useful because it doesn't provide failover across HCA's.</BLOCKQUOTE>
<DIV><BR>I think others may disagree about whether RDS solves the
problem. You have no way of knowing whether something was received or
not into the other node's coherency domain without some intermediary or
application's involvement to see the data arrived. As such, you might
see many hardware level acks occur and not know there is a real failure.
If an application takes any action assuming that send complete means it is
delivered, then it is subject to silent data corruption. Hence, RDS can
replay to its heart content but until there is an application or middleware
level of acknowledgement, you have not solve the fault domain issues.
Some may be happy with this as they just cast out the endnode from the cluster
/ database but others see the loss of a server as a big deal so may not be
happy to see this occur. It really comes down to whether you believe
loosing a server is worth while just for a local failure event which is not
fatal to the rest of the server.<BR><BR><SPAN class=945263916-09112005><FONT
face=Arial color=#0000ff size=2>[cait] </FONT></SPAN></DIV></BLOCKQUOTE>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>Applications should not infer anything from send
completion other than that their source</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>buffer is no longer requried for the transmit to
complete.</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005></SPAN></FONT></FONT></FONT> </DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>That is the only assumption that can be supported in a
transport neutral way.</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005></SPAN></FONT></FONT></FONT> </DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>I'll also point out that even under InfiniBand the fact
that a send or write has</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>completed does NOT guarantee that the remote peer has
*noticed* the data.</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>The Remote peer could fail *after* the date has been
delivered to it and before</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>it has had a chance to act upon it. A well-designed
robust application should</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>never rely on anything other than a peer ack to
indicate that the peer has truly</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>taken ownership of transmitted
information.</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005></SPAN></FONT></FONT></FONT> </DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>The essence of RDS, or any similar solution, is the
delivery of message with</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>datagram semantics reliably over point-to-point
reliable connections. So whatever</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>reliability and fault-tolerance benefits the reliable
connections are inherited by</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>the RDS layer. After that it is mostly a matter of how
you avoid head-of-line</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>blocking problems when there is no receive buffer. You
don't want to send</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>an RNR (or drop the DDP Segment under iWARP) because
*one* endpoint</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>does not have available buffers. Other than that any
reliable datagram service</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005>should be just as reliable as the underlying rc
service.</SPAN></FONT></FONT></FONT></DIV>
<DIV><FONT face=Arial><FONT size=2><FONT color=#0000ff><SPAN
class=945263916-09112005></SPAN></FONT></FONT></FONT> </DIV></FONT></BODY></HTML>