[openib-general] [ANNOUNCE] Contribute RDS (ReliableDatagramSockets) to OpenIB

Ranjit Pandit rpandit at silverstorm.com
Tue Nov 8 12:33:35 PST 2005


> Mike wrote:
>  - RDS does not solve a set of failure models.  For example, if a RNIC / HCA
> were to fail, then one cannot simply replay the operations on another RNIC /
> HCA without extracting state, etc. and providing some end-to-end sync of
> what was really sent / received by the application.  Yes, one can recover
> from cable or switch port failure by using APM style recovery but that is
> only one class of faults.  The harder faults either result in the end node
> being cast out of the cluster or see silent data corruption unless
> additional steps are taken to transparently recover - again app writers
> don't want to solve the hard problems; they want that done for them.

The current reference implementation of RDS solves the HCA failure case as well.
Since applications don't need to keep connection states, it's easier
to handle cases like HCA and intermediate path failures.
As far as application is concerned, every sendmsg 'could' result in a
new connection setup in the driver.
If the current path fails, RDS reestablishes a connection, if
available, on a different port or a different HCA , and replays the
failed messages.
Using APM is not useful because it doesn't provide failover across HCA's.

> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
>
>



More information about the general mailing list