[ofa-general] What is the size of async event queue ?

Or Gerlitz ogerlitz at voltaire.com
Tue Mar 6 00:00:29 PST 2007


Tang, Changqing wrote:
>> I want to understand what is the exact fearure you need.

> I want our MPI code can survive from connection loss, or peer
> process/machine crash. This process can detect any IB error, and then
> clean that connection, use healthy connections only, and possibly make
> new connections.

Again, note that your attempt to use a "non healthy" connection would 
end up with a notification on the problem (completion with error etc).

OK. There are quite a few cases here... thinking loud, if you want to go 
the simplest way, zero-len-rdma-write keep alive protocol seems to catch 
them all.

If you want to avoid the traffic overhead incurred by such a protocol, 
and you are willing to go in a less simple approach, i suggest to define 
exactly what are the cases you want to handle and what is the excepted 
action after the local process realized the conn is lost, eg


	case			expected action
remote process crash
remote process hang
remote machine crash
remote machine hang
etc etc etc

and then see what approach can work.

Or.







More information about the general mailing list