[openib-general] CQ error handling in IPoIB

Moni Shoua monis at voltaire.com
Wed Jan 3 07:42:10 PST 2007


Todd Rimmer wrote:
>>From: Moni Shoua
>>Sent: Tuesday, January 02, 2007 11:31 AM
>>To: openib-general at openib.org
>>Subject: [openib-general] CQ error handling in IPoIB
>>
>>Hi,
>>I have a question regarding error handling in IPoIB.
>>
>>The spec says...
>>
>>When a CQ encounters an error, in order to be able to use the CQ
> 
> again,
> 
>>the consumer should:
>>* Destroy all the QPs that are attached to the CQ
>>* Destroy the CQ
>>* Recreate the CQ through the Create Completion Queue verb
>>
>>While (at least one part of) the code does...
>>
>>static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc
>>*wc)
>>{
>>	...
>>	...
>>	...
>>        if (wc->status != IB_WC_SUCCESS &&
>>            wc->status != IB_WC_WR_FLUSH_ERR)
>>                ipoib_warn(priv, "failed send event "
>>                           "(status=%d, wrid=%d vend_err %x)\n",
>>                           wc->status, wr_id, wc->vendor_err);
>>}
>>
> 
> 
> In this context the spec is referring to CQ errors, not work request
> errors.  For example, CQ overflow is considered a CQ error and would
> require the procedure you describe above (destroy QPs, CQ, etc).
> 
> However a work request error is a WQE or QP error.  As such the CQ does
> not need to be destroyed.  Rather the recovery will be limited to QP
> level actions.  Typically the QP has moved to the error state and the QP
> must be reset and moved back to RTS to resume operation (or the QP must
> be destroyed and recreated).
> 
> If you check section 10.10.3.4 of IBTA 1.2 you will see a list of
> possible errors on a UD QP.  Notice that the errors all involve Local
> Protection or Operation errors.  Hence they cannot be caused by a remote
> node.  Rather, they are only caused by invalid local requests (by IPoIB
> in this case) or possibly by hardware or OS problems (memory stomps,
> multi-bit undetected memory or bus errors, HCA hardware problem, etc).
> 
> As you indicate, when such error occur, the driver should recreate or
> reset the QP.
> 
> Todd Rimmer
> 

Thanks for the detailed answer.
I see my mistake in bringing that quote from the spec but the question 
was answered even though.





More information about the general mailing list