[openib-general] [ANNOUNCE] Contribute RDS(ReliableDatagramSockets) to OpenIB
Michael Krause
krause at cup.hp.com
Tue Nov 15 06:46:36 PST 2005
At 12:49 PM 11/14/2005, Nitin Hande wrote:
>Michael Krause wrote:
>>At 01:01 PM 11/11/2005, Nitin Hande wrote:
>>
>>>Michael Krause wrote:
>>>
>>>>At 10:28 AM 11/9/2005, Rick Frank wrote:
>>>>
>>>>>Yes, the application is responsible for detecting lost msgs at the
>>>>>application level - the transport can not do this.
>>>>>
>>>>>RDS does not guarantee that a message has been delivered to the
>>>>>application - just that once the transport has accepted a msg it will
>>>>>deliver the msg to the remote node in order without duplication -
>>>>>dealing with retransmissions, etc due to sporadic / intermittent msg
>>>>>loss over the interconnect. If after accepting the send - the current
>>>>>path fails - then RDS will transparently fail over to another path -
>>>>>and if required will resend / send any already queued msgs to the
>>>>>remote node - again insuring that no msg is duplicated and they are in
>>>>>order. This is no different than APM - with the exception that RDS
>>>>>can do this across HCAs.
>>>>>
>>>>>The application - Oracle in this case - will deal with detecting a
>>>>>catastrophic path failure - either due to a send that does not arrive
>>>>>and or a timedout response or send failure returned from the
>>>>>transport. If there is no network path to a remote node - it is
>>>>>required that we remove the remote node from the operating cluster to
>>>>>avoid what is commonly termed as a "split brain" condition - otherwise
>>>>>known as a "partition in time".
>>>>>
>>>>>BTW - in our case - the application failure domain logic is the same
>>>>>whether we are using UDP / uDAPL / iTAPI / TCP / SCTP / etc.
>>>>>Basically, if we can not talk to a remote node - after some defined
>>>>>period of time - we will remove the remote node from the cluster. In
>>>>>this case the database will recover all the interesting state that may
>>>>>have been maintained on the removed node - allowing the remaining
>>>>>nodes to continue. If later on, communication to the remote node is
>>>>>restored - it will be allowed to rejoin the cluster and take on
>>>>>application load.
>>>>
>>>>
>>>>Please clarify the following which was in the document provided by Oracle.
>>>>On page 3 of the RDS document, under the section "RDP Interface", the
>>>>2nd and 3rd paragraphs are state:
>>>> * RDP does not guarantee that a datagram is delivered to the remote
>>>> application.
>>>> * It is up to the RDP client to deal with datagrams lost due to
>>>> transport failure or remote application failure.
>>>>The HCA is still a fault domain with RDS - it does not address flushing
>>>>data out of the HCA fault domain, nor does it sound like it ensures
>>>>that CQE loss is recoverable.
>>>>I do believe RDS will replay all of the sendmsg's that it believes are
>>>>pending, but it has no way to determine if already sent sendmsgs were
>>>>actually successfully delivered to the remote application unless it
>>>>provides some level of resync of the outstanding sends not completed
>>>>from an application's perspective as well as any state updated via RDMA
>>>>operations which may occur without an explicit send operation to flush
>>>>to a known state.
>>>
>>>If RDS could define a mechanism that the application could use to inform
>>>the sender to resync and replay on catastrophic failure, is that a
>>>correct understanding of your suggestion ?
>>
>>I'm not suggesting anything at this point. I'm trying to reconcile the
>>documentation with the e-mail statements made by its proponents.
>>
>>>I'm still trying to ascertain whether RDS completely
>>>
>>>>recovers from HCA failure (assuming there is another HCA / path
>>>>available) between the two endnodes
>>>
>>>Reading at the doc and the thread, it looks like we need src/dst port
>>>for multiplexing connections, we need seq/ack# for resyncing, we need
>>>some kind of window availability for flow control. Are'nt we very close
>>>to tcp header ? ..
>>
>>TCP does not provide end-to-end to the application as implemented by most
>>OS. Unless one ties TCP ACK to the application's consumption of the
>>receive data, there is no method to ascertain that the application really
>>received the data. The application would be required to send its own
>>application-level acknowledgement. I believe the intent is for
>>applications to remain responsible for the end-to-end receipt of data and
>>that RDS and the interconnect are simply responsible for the exchange at
>>the lower levels.
>Yes, a TCP ack only implies that it has received the data, and means
>nothing to the application. It is the application which has send a
>application level ack to its peer.
TCP ACK was intended to be an end-to-end ACK but implementations took it to
a lower level ACK only. A TCP stack linked into an application as
demonstrated by multiple IHV and research does provide an end-to-end ACK
and considerable performance improvements over the traditional network
stack implementations. Some claim it is more than good enough to eliminate
the need for protocol off-load / RDMA which is true for many applications
(certainly for most Sockets, etc.) but not true when one takes advantage
of the RDMA comms paradigm which has benefit for a number of applications.
Mike
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20051115/b2920fa8/attachment.html>
More information about the general
mailing list