[ofa-general] IBV_WC_RETRY_EXC_ERR causes

Krishnamoorthy, Sriram sriram at pnl.gov
Fri Jun 20 15:29:06 PDT 2008


>IBV_WC_RETRY_EXE_ERR means that there wasn't any ack by the receiver
after 4.096*(2 >
>power 18) * 7 usec.
Does an ack from the receiver require the process/thread to be awake? I
have been trying to get a small test case, and sleeping without posting
enough recv-s seems to occasionally result in IBV_WC_RETRY_EXC_ERR
(instead of IBV_WC_RNR_RETRY_EXC_ERR which occurs a lot more often, and
of course with much smaller timeout, retry_count, and rnr_retry_count). 
>It can happen because of several reasons:
>1) bad QP attributes
>2) the remote side wasn't exists or it is in bad state
>3) rare, but congestion in the network can causes this too

>7 means infinite retry only for RNR flow, for retry flow 7 is the
number of time of the 
>retransmission.

>How do you connect the both sides?
>maybe the sender send messages to QP wasn't transfered to (at least)
RTR state?

All queue pairs are transitioned into RTS state before any
communication. All queue pairs are transitioned to RTR state, then there
is an MPI barrier (which could be using its own queue pairs or sockets),
and then all queue pairs are transitioned into RTS state.

All error messages out of verbs API are checked. Is it possible for a
queue pair to transition into an error state and it is identified first
as an IBV_WC_RETRY_EXC_ERR and not as a local error?

Thanks,
Sriram.K



More information about the general mailing list