[ofa-general] What is the size of async event queue ?

Mon Mar 5 08:28:31 PST 2007

> 
> CQ,
> 
> I want to understand what is the exact fearure you need.

I want our MPI code can survive from connection loss, or peer
process/machine crash. This process can detect any IB error, and then
clean that connection, use healthy connections only, and possibly make
new connections.

If the error is global to this process, not just to a single connection,
then we just abort this process.

--CQ

> 
> for example, if TCP is used the equivalent of this is that 
> following a remote process crash the remote node/s TCP stack 
> close the TCP connections and when ever the local process 
> attempts to use the socket it get an errno telling this 
> connection was closed ?!
> 
> Since you use RC QP, --if-- you attempt doing post_send (or 
> rdma) to a QP whose connected peer QP is not responding, you 
> will get CQ completion with "retry exceeded" error.
> 
> If the above case (notification following post send) is not 
> enough, the IB CM which you can use through libibcm or 
> librdmacm provides the same functionality (sends DREQ if the 
> process crashes) with the distinction that over TCP the same 
> primitive (socket) is use for conn management and conn data 
> xfer, where over IB, the QP is used for data and the IB CM Id 
> (or the RDMA CM Id) is used for conn management.
> 
> Combining possibilities: if you want to get a notification on 
> every peer process crash, you would need to either 
> poll/select once a while the libibcm/librdmacm event queue or 
> implement some keep a live of your own protocol. For 
> instance, I think the IB spec mentions doing zero length rdma 
> write once in a while as a mean for implementing such protocol.
> 
> Or.
>