[openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB
krause at cup.hp.com
Wed Nov 9 12:18:28 PST 2005
At 11:42 AM 11/9/2005, Greg Lindahl wrote:
>On Tue, Nov 08, 2005 at 01:08:13PM -0800, Michael Krause wrote:
> > If an application takes any action assuming that send complete means
> > it is delivered, then it is subject to silent data corruption.
>Right. That's the same as pretty much all other *transport* layers. I
>don't think anyone's asserting RDS is any different: you can't assume
>the other side's application received and acted on your message until
>the other side's application tells you that it did.
So, things like HCA failure are not transparent and one cannot simply
replay the operations since you don't know what was really seen by the
other side unless the application performs the resync itself. Hence, while
RDS can attempt to retransmit, the application must deal with duplicates,
etc. or note the error, resync, and retransmit to avoid duplicates.
BTW, host-based transport implementations can transparently recover from
device failure on behalf of applications since their state is in the host
and not in the failed device - this is true for networking, storage,
etc. HCA / RNIC / TOE / FC / etc. all loose state or cannot be trusted
thus must rely upon upper level software to perform the recovery, resync,
retransmission, etc. Unless RDS has implemented its own state checkpoint
between endnodes, this class of failures must be solved by the application
since it cannot be solved in the hardware. Hence, RDS may push some of its
reliability requirements to the interconnect but it does not eliminate all
reliability requirements from the application or RDS itself.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the general