[openib-general] [ANNOUNCE] Contribute RDS ( ReliableDatagramSockets) to OpenIB

Wed Nov 9 08:46:37 PST 2005

________________________________

	From: openib-general-bounces at openib.org
[mailto:openib-general-bounces at openib.org] On Behalf Of Michael Krause
	Sent: Tuesday, November 08, 2005 1:08 PM
	To: Ranjit Pandit
	Cc: openib-general at openib.org
	Subject: Re: [openib-general] [ANNOUNCE] Contribute RDS (
ReliableDatagramSockets) to OpenIB

	At 12:33 PM 11/8/2005, Ranjit Pandit wrote:

		> Mike wrote:
		>  - RDS does not solve a set of failure models.  For
example, if a RNIC / HCA
		> were to fail, then one cannot simply replay the
operations on another RNIC /
		> HCA without extracting state, etc. and providing some
end-to-end sync of
		> what was really sent / received by the application.
Yes, one can recover
		> from cable or switch port failure by using APM style
recovery but that is
		> only one class of faults.  The harder faults either
result in the end node
		> being cast out of the cluster or see silent data
corruption unless
		> additional steps are taken to transparently recover -
again app writers
		> don't want to solve the hard problems; they want that
done for them.

		The current reference implementation of RDS solves the
HCA failure case as well.
		Since applications don't need to keep connection states,
it's easier
		to handle cases like HCA and intermediate path failures.
		As far as application is concerned, every sendmsg
'could' result in a
		new connection setup in the driver.
		If the current path fails, RDS reestablishes a
connection, if
		available, on a different port or a different HCA , and
replays the
		failed messages.
		Using APM is not useful because it doesn't provide
failover across HCA's.

	I think others may disagree about whether RDS solves the
problem.  You have no way of knowing whether something was received or
not into the other node's coherency domain without some intermediary or
application's involvement to see the data arrived.  As such, you might
see many hardware level acks occur and not know there is a real failure.
If an application takes any action assuming that send complete means it
is delivered, then it is subject to silent data corruption.  Hence, RDS
can replay to its heart content but until there is an application or
middleware level of acknowledgement, you have not solve the fault domain
issues.  Some may be happy with this as they just cast out the endnode
from the cluster / database but others see the loss of a server as a big
deal so may not be happy to see this occur.  It really comes down to
whether you believe loosing a server is worth while just for a local
failure event which is not fatal to the rest of the server.

	[cait] 

Applications should not infer anything from send completion other than
that their source
buffer is no longer requried for the transmit to complete.

That is the only assumption that can be supported in a transport neutral
way.

I'll also point out that even under InfiniBand the fact that a send or
write has
completed does NOT guarantee that the remote peer has *noticed* the
data.
The Remote peer could fail *after* the date has been delivered to it and
before
it has had a chance to act upon it. A well-designed robust application
should
never rely on anything other than a peer ack to indicate that the peer
has truly
taken ownership of transmitted information.

The essence of RDS, or any similar solution, is the delivery of message
with
datagram semantics reliably over point-to-point reliable connections. So
whatever
reliability and fault-tolerance benefits the reliable connections are
inherited by
the RDS layer. After that it is mostly a matter of how you avoid
head-of-line
blocking problems when there is no receive buffer. You don't want to
send
an RNR (or drop the DDP Segment under iWARP) because *one* endpoint
does not have available buffers. Other than that any reliable datagram
service
should be just as reliable as the underlying rc service.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20051109/d94ea75f/attachment.html>