[openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB

Nitin Hande Nitin.Hande at Sun.COM
Mon Nov 14 12:49:44 PST 2005


Michael Krause wrote:
> At 01:01 PM 11/11/2005, Nitin Hande wrote:
> 
>> Michael Krause wrote:
>>
>>> At 10:28 AM 11/9/2005, Rick Frank wrote:
>>>
>>>> Yes, the application is responsible for detecting lost msgs at the 
>>>> application level - the transport can not do this.
>>>>  
>>>> RDS does not guarantee that a message has been delivered to the 
>>>> application - just that once the transport has accepted a msg it 
>>>> will deliver the msg to the remote node in order without duplication 
>>>> - dealing with retransmissions, etc due to sporadic / intermittent 
>>>> msg loss over the interconnect. If after accepting the send - the 
>>>> current path fails - then RDS will transparently fail over to 
>>>> another path - and if required will resend / send any already queued 
>>>> msgs to the remote node - again insuring that no msg is duplicated 
>>>> and they are in order.  This is no different than APM - with the 
>>>> exception that RDS can do this across HCAs.
>>>>  
>>>> The application - Oracle in this case - will deal with detecting a 
>>>> catastrophic path failure - either due to a send that does not 
>>>> arrive and or a timedout response or send failure returned from the 
>>>> transport. If there is no network path to a remote node - it is 
>>>> required that we remove the remote node from the operating cluster 
>>>> to avoid what is commonly termed as a "split brain" condition - 
>>>> otherwise known as a "partition in time".
>>>>  
>>>> BTW - in our case - the application failure domain logic is the same 
>>>> whether we are using UDP /  uDAPL / iTAPI / TCP / SCTP / etc. 
>>>> Basically, if we can not talk to a remote node - after some defined 
>>>> period of time - we will remove the remote node from the cluster. In 
>>>> this case the database will recover all the interesting state that 
>>>> may have been maintained on the removed node - allowing the 
>>>> remaining nodes to continue. If later on, communication to the 
>>>> remote node is restored - it will be allowed to rejoin the cluster 
>>>> and take on application load. 
>>>
>>>
>>> Please clarify the following which was in the document provided by 
>>> Oracle.
>>> On page 3 of the RDS document, under the section "RDP Interface", the 
>>> 2nd and 3rd paragraphs are state:
>>>    * RDP does not guarantee that a datagram is delivered to the 
>>> remote application.
>>>    * It is up to the RDP client to deal with datagrams lost due to 
>>> transport failure or remote application failure.
>>> The HCA is still a fault domain with RDS - it does not address 
>>> flushing data out of the HCA fault domain, nor does it sound like it 
>>> ensures that CQE loss is recoverable.
>>> I do believe RDS will replay all of the sendmsg's that it believes 
>>> are pending, but it has no way to determine if already sent sendmsgs 
>>> were actually successfully delivered to the remote application unless 
>>> it provides some level of resync of the outstanding sends not 
>>> completed from an application's perspective as well as any state 
>>> updated via RDMA operations which may occur without an explicit send 
>>> operation to flush to a known state.  
>>
>> If RDS could define a mechanism that the application could use to 
>> inform the sender to resync and replay on catastrophic failure, is 
>> that a correct understanding of your suggestion ?
> 
> 
> I'm not suggesting anything at this point. I'm trying to reconcile the 
> documentation with the e-mail statements made by its proponents.
> 
>> I'm still trying to ascertain whether RDS completely
>>
>>> recovers from HCA failure (assuming there is another HCA / path 
>>> available) between the two endnodes
>>
>> Reading at the doc and the thread, it looks like we need src/dst port 
>> for multiplexing connections, we need seq/ack# for resyncing, we need 
>> some kind of window availability for flow control. Are'nt we very 
>> close to tcp header ? ..
> 
> 
> TCP does not provide end-to-end to the application as implemented by 
> most OS. Unless one ties TCP ACK to the application's consumption of the 
> receive data, there is no method to ascertain that the application 
> really received the data.   The application would be required to send 
> its own application-level acknowledgement.   I believe the intent is for 
> applications to remain responsible for the end-to-end receipt of data 
> and that RDS and the interconnect are simply responsible for the 
> exchange at the lower levels.
Yes, a TCP ack only implies that it has received the data, and means 
nothing to the application. It is the application which has send a 
application level ack to its peer.

Nitin

> 
> Mike
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general




More information about the general mailing list