[openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB

Tue Nov 15 06:46:36 PST 2005

At 12:49 PM 11/14/2005, Nitin Hande wrote:
>Michael Krause wrote:
>>At 01:01 PM 11/11/2005, Nitin Hande wrote:
>>
>>>Michael Krause wrote:
>>>
>>>>At 10:28 AM 11/9/2005, Rick Frank wrote:
>>>>
>>>>>Yes, the application is responsible for detecting lost msgs at the 
>>>>>application level - the transport can not do this.
>>>>>
>>>>>RDS does not guarantee that a message has been delivered to the 
>>>>>application - just that once the transport has accepted a msg it will 
>>>>>deliver the msg to the remote node in order without duplication - 
>>>>>dealing with retransmissions, etc due to sporadic / intermittent msg 
>>>>>loss over the interconnect. If after accepting the send - the current 
>>>>>path fails - then RDS will transparently fail over to another path - 
>>>>>and if required will resend / send any already queued msgs to the 
>>>>>remote node - again insuring that no msg is duplicated and they are in 
>>>>>order.  This is no different than APM - with the exception that RDS 
>>>>>can do this across HCAs.
>>>>>
>>>>>The application - Oracle in this case - will deal with detecting a 
>>>>>catastrophic path failure - either due to a send that does not arrive 
>>>>>and or a timedout response or send failure returned from the 
>>>>>transport. If there is no network path to a remote node - it is 
>>>>>required that we remove the remote node from the operating cluster to 
>>>>>avoid what is commonly termed as a "split brain" condition - otherwise 
>>>>>known as a "partition in time".
>>>>>
>>>>>BTW - in our case - the application failure domain logic is the same 
>>>>>whether we are using UDP /  uDAPL / iTAPI / TCP / SCTP / etc. 
>>>>>Basically, if we can not talk to a remote node - after some defined 
>>>>>period of time - we will remove the remote node from the cluster. In 
>>>>>this case the database will recover all the interesting state that may 
>>>>>have been maintained on the removed node - allowing the remaining 
>>>>>nodes to continue. If later on, communication to the remote node is 
>>>>>restored - it will be allowed to rejoin the cluster and take on 
>>>>>application load.
>>>>
>>>>
>>>>Please clarify the following which was in the document provided by Oracle.
>>>>On page 3 of the RDS document, under the section "RDP Interface", the 
>>>>2nd and 3rd paragraphs are state:
>>>>    * RDP does not guarantee that a datagram is delivered to the remote 
>>>> application.
>>>>    * It is up to the RDP client to deal with datagrams lost due to 
>>>> transport failure or remote application failure.
>>>>The HCA is still a fault domain with RDS - it does not address flushing 
>>>>data out of the HCA fault domain, nor does it sound like it ensures 
>>>>that CQE loss is recoverable.
>>>>I do believe RDS will replay all of the sendmsg's that it believes are 
>>>>pending, but it has no way to determine if already sent sendmsgs were 
>>>>actually successfully delivered to the remote application unless it 
>>>>provides some level of resync of the outstanding sends not completed 
>>>>from an application's perspective as well as any state updated via RDMA 
>>>>operations which may occur without an explicit send operation to flush 
>>>>to a known state.
>>>
>>>If RDS could define a mechanism that the application could use to inform 
>>>the sender to resync and replay on catastrophic failure, is that a 
>>>correct understanding of your suggestion ?
>>
>>I'm not suggesting anything at this point. I'm trying to reconcile the 
>>documentation with the e-mail statements made by its proponents.
>>
>>>I'm still trying to ascertain whether RDS completely
>>>
>>>>recovers from HCA failure (assuming there is another HCA / path 
>>>>available) between the two endnodes
>>>
>>>Reading at the doc and the thread, it looks like we need src/dst port 
>>>for multiplexing connections, we need seq/ack# for resyncing, we need 
>>>some kind of window availability for flow control. Are'nt we very close 
>>>to tcp header ? ..
>>
>>TCP does not provide end-to-end to the application as implemented by most 
>>OS. Unless one ties TCP ACK to the application's consumption of the 
>>receive data, there is no method to ascertain that the application really 
>>received the data.   The application would be required to send its own 
>>application-level acknowledgement.   I believe the intent is for 
>>applications to remain responsible for the end-to-end receipt of data and 
>>that RDS and the interconnect are simply responsible for the exchange at 
>>the lower levels.
>Yes, a TCP ack only implies that it has received the data, and means 
>nothing to the application. It is the application which has send a 
>application level ack to its peer.

TCP ACK was intended to be an end-to-end ACK but implementations took it to 
a lower level ACK only.  A TCP stack linked into an application as 
demonstrated by multiple IHV and research does provide an end-to-end ACK 
and considerable performance improvements over the traditional network 
stack implementations.  Some claim it is more than good enough to eliminate 
the need for protocol off-load / RDMA which is true for many applications 
(certainly for most Sockets, etc.)  but not true when one takes advantage 
of the RDMA comms paradigm which has benefit for a number of applications.

Mike 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20051115/b2920fa8/attachment.html>