[ofa-general] What is the size of async event queue ?
Or Gerlitz
ogerlitz at voltaire.com
Tue Mar 6 00:00:29 PST 2007
Tang, Changqing wrote:
>> I want to understand what is the exact fearure you need.
> I want our MPI code can survive from connection loss, or peer
> process/machine crash. This process can detect any IB error, and then
> clean that connection, use healthy connections only, and possibly make
> new connections.
Again, note that your attempt to use a "non healthy" connection would
end up with a notification on the problem (completion with error etc).
OK. There are quite a few cases here... thinking loud, if you want to go
the simplest way, zero-len-rdma-write keep alive protocol seems to catch
them all.
If you want to avoid the traffic overhead incurred by such a protocol,
and you are willing to go in a less simple approach, i suggest to define
exactly what are the cases you want to handle and what is the excepted
action after the local process realized the conn is lost, eg
case expected action
remote process crash
remote process hang
remote machine crash
remote machine hang
etc etc etc
and then see what approach can work.
Or.
More information about the general
mailing list