[openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
Nitin Hande
Nitin.Hande at Sun.COM
Mon Nov 14 12:49:44 PST 2005
Michael Krause wrote:
> At 01:01 PM 11/11/2005, Nitin Hande wrote:
>
>> Michael Krause wrote:
>>
>>> At 10:28 AM 11/9/2005, Rick Frank wrote:
>>>
>>>> Yes, the application is responsible for detecting lost msgs at the
>>>> application level - the transport can not do this.
>>>>
>>>> RDS does not guarantee that a message has been delivered to the
>>>> application - just that once the transport has accepted a msg it
>>>> will deliver the msg to the remote node in order without duplication
>>>> - dealing with retransmissions, etc due to sporadic / intermittent
>>>> msg loss over the interconnect. If after accepting the send - the
>>>> current path fails - then RDS will transparently fail over to
>>>> another path - and if required will resend / send any already queued
>>>> msgs to the remote node - again insuring that no msg is duplicated
>>>> and they are in order. This is no different than APM - with the
>>>> exception that RDS can do this across HCAs.
>>>>
>>>> The application - Oracle in this case - will deal with detecting a
>>>> catastrophic path failure - either due to a send that does not
>>>> arrive and or a timedout response or send failure returned from the
>>>> transport. If there is no network path to a remote node - it is
>>>> required that we remove the remote node from the operating cluster
>>>> to avoid what is commonly termed as a "split brain" condition -
>>>> otherwise known as a "partition in time".
>>>>
>>>> BTW - in our case - the application failure domain logic is the same
>>>> whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc.
>>>> Basically, if we can not talk to a remote node - after some defined
>>>> period of time - we will remove the remote node from the cluster. In
>>>> this case the database will recover all the interesting state that
>>>> may have been maintained on the removed node - allowing the
>>>> remaining nodes to continue. If later on, communication to the
>>>> remote node is restored - it will be allowed to rejoin the cluster
>>>> and take on application load.
>>>
>>>
>>> Please clarify the following which was in the document provided by
>>> Oracle.
>>> On page 3 of the RDS document, under the section "RDP Interface", the
>>> 2nd and 3rd paragraphs are state:
>>> * RDP does not guarantee that a datagram is delivered to the
>>> remote application.
>>> * It is up to the RDP client to deal with datagrams lost due to
>>> transport failure or remote application failure.
>>> The HCA is still a fault domain with RDS - it does not address
>>> flushing data out of the HCA fault domain, nor does it sound like it
>>> ensures that CQE loss is recoverable.
>>> I do believe RDS will replay all of the sendmsg's that it believes
>>> are pending, but it has no way to determine if already sent sendmsgs
>>> were actually successfully delivered to the remote application unless
>>> it provides some level of resync of the outstanding sends not
>>> completed from an application's perspective as well as any state
>>> updated via RDMA operations which may occur without an explicit send
>>> operation to flush to a known state.
>>
>> If RDS could define a mechanism that the application could use to
>> inform the sender to resync and replay on catastrophic failure, is
>> that a correct understanding of your suggestion ?
>
>
> I'm not suggesting anything at this point. I'm trying to reconcile the
> documentation with the e-mail statements made by its proponents.
>
>> I'm still trying to ascertain whether RDS completely
>>
>>> recovers from HCA failure (assuming there is another HCA / path
>>> available) between the two endnodes
>>
>> Reading at the doc and the thread, it looks like we need src/dst port
>> for multiplexing connections, we need seq/ack# for resyncing, we need
>> some kind of window availability for flow control. Are'nt we very
>> close to tcp header ? ..
>
>
> TCP does not provide end-to-end to the application as implemented by
> most OS. Unless one ties TCP ACK to the application's consumption of the
> receive data, there is no method to ascertain that the application
> really received the data. The application would be required to send
> its own application-level acknowledgement. I believe the intent is for
> applications to remain responsible for the end-to-end receipt of data
> and that RDS and the interconnect are simply responsible for the
> exchange at the lower levels.
Yes, a TCP ack only implies that it has received the data, and means
nothing to the application. It is the application which has send a
application level ack to its peer.
Nitin
>
> Mike
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list