<html>

<body>

<font size=3>At 04:44 AM 5/4/2005, Michael S. Tsirkin wrote:<br><br>

<br>

<blockquote type=cite class=cite cite="">The problem is not that hard.

The driver must provide sufficient<br>

information to discover that the device went down and which requests<br>

were serviced before the device went down.<br><br>

Libraries such as MPI will be able to use that to resend these requests

on<br>

a different HCA or after an HCA is plugged back in.<br><br>

Applications will simply use this library and be happy.</blockquote><br>

The operating assumption here is that applications or middleware can

simply restart.  That is a nice theory but not always practical

especially when taking the service usage model.  Some applications

seeing failure may deem a cluster partitioning event has occurred and

initiate membership quorum, etc. algorithms.  Others might invoke DR

algorithms and fail over to alternate systems or sites.  This

problem has been worked for many years by various vendors and it still

isn't fully solved.  Do not think that simple resource recovery and

restart is an acceptable answer to customers.  It is the minimum any

subsystem or device should support but that does not translate

automatically into a viable solution.<br><br>

Mike </font></body>

</html>