[openib-general] Re: RFD: uverbs and hotplug

Michael Krause krause at cup.hp.com
Wed May 4 08:46:07 PDT 2005


At 04:44 AM 5/4/2005, Michael S. Tsirkin wrote:


>The problem is not that hard. The driver must provide sufficient
>information to discover that the device went down and which requests
>were serviced before the device went down.
>
>Libraries such as MPI will be able to use that to resend these requests on
>a different HCA or after an HCA is plugged back in.
>
>Applications will simply use this library and be happy.

The operating assumption here is that applications or middleware can simply 
restart.  That is a nice theory but not always practical especially when 
taking the service usage model.  Some applications seeing failure may deem 
a cluster partitioning event has occurred and initiate membership quorum, 
etc. algorithms.  Others might invoke DR algorithms and fail over to 
alternate systems or sites.  This problem has been worked for many years by 
various vendors and it still isn't fully solved.  Do not think that simple 
resource recovery and restart is an acceptable answer to customers.  It is 
the minimum any subsystem or device should support but that does not 
translate automatically into a viable solution.

Mike  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20050504/908d415d/attachment.html>


More information about the general mailing list