[openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB

Michael Krause krause at cup.hp.com
Wed Nov 9 12:18:28 PST 2005


At 11:42 AM 11/9/2005, Greg Lindahl wrote:
>On Tue, Nov 08, 2005 at 01:08:13PM -0800, Michael Krause wrote:
>
> > If an application takes any action assuming that send complete means
> > it is delivered, then it is subject to silent data corruption.
>
>Right. That's the same as pretty much all other *transport* layers. I
>don't think anyone's asserting RDS is any different: you can't assume
>the other side's application received and acted on your message until
>the other side's application tells you that it did.

So, things like HCA failure are not transparent and one cannot simply 
replay the operations since you don't know what was really seen by the 
other side unless the application performs the resync itself.  Hence, while 
RDS can attempt to retransmit, the application must deal with duplicates, 
etc. or note the error, resync, and retransmit to avoid duplicates.

BTW, host-based transport implementations can transparently recover from 
device failure on behalf of applications since their state is in the host 
and not in the failed device - this is true for networking, storage, 
etc.  HCA / RNIC / TOE / FC / etc. all loose state or cannot be trusted 
thus must rely upon upper level software to perform the recovery, resync, 
retransmission, etc.  Unless RDS has implemented its own state checkpoint 
between endnodes, this class of failures must be solved by the application 
since it cannot be solved in the hardware.  Hence, RDS may push some of its 
reliability requirements to the interconnect but it does not eliminate all 
reliability requirements from the application or RDS itself.

Mike 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20051109/ea9b5d58/attachment.html>


More information about the general mailing list