[ofa-general] IBV_WC_RETRY_EXC_ERR causes

Dotan Barak dotanba at gmail.com
Sat Jun 21 06:23:33 PDT 2008


Krishnamoorthy, Sriram wrote:
>> IBV_WC_RETRY_EXE_ERR means that there wasn't any ack by the receiver
>>     
> after 4.096*(2 >
>   
>> power 18) * 7 usec.
>>     
> Does an ack from the receiver require the process/thread to be awake? I
> have been trying to get a small test case, and sleeping without posting
> enough recv-s seems to occasionally result in IBV_WC_RETRY_EXC_ERR
> (instead of IBV_WC_RNR_RETRY_EXC_ERR which occurs a lot more often, and
> of course with much smaller timeout, retry_count, and rnr_retry_count). 
>   
No, the ack is being handled in the HCA level (unless, the transport of 
the IB is being implemented in SW...)
If putting sleep in your code causes IBV_WC_RETRY_EXC_ERR i would 
suspect SW bugs ...

>> It can happen because of several reasons:
>> 1) bad QP attributes
>> 2) the remote side wasn't exists or it is in bad state
>> 3) rare, but congestion in the network can causes this too
>>     
>
>   
>> 7 means infinite retry only for RNR flow, for retry flow 7 is the
>>     
> number of time of the 
>   
>> retransmission.
>>     
>
>   
>> How do you connect the both sides?
>> maybe the sender send messages to QP wasn't transfered to (at least)
>>     
> RTR state?
>
> All queue pairs are transitioned into RTS state before any
> communication. All queue pairs are transitioned to RTR state, then there
> is an MPI barrier (which could be using its own queue pairs or sockets),
> and then all queue pairs are transitioned into RTS state.
>   
Good, i was afraid from any race when one side start to send messages 
and the other side wasn't in RTR state.
> All error messages out of verbs API are checked. Is it possible for a
> queue pair to transition into an error state and it is identified first
> as an IBV_WC_RETRY_EXC_ERR and not as a local error?
>   

Theoretically: no.
Is this is the first message that is being passed in those QPs?
Can you check the QP state of the remote side when you get such an error?

> Thanks,
> Sriram.K
>   
You are welcome
Dotan



More information about the general mailing list