[openib-general] CQ error handling in IPoIB
Moni Shoua
monis at voltaire.com
Wed Jan 3 07:42:10 PST 2007
Todd Rimmer wrote:
>>From: Moni Shoua
>>Sent: Tuesday, January 02, 2007 11:31 AM
>>To: openib-general at openib.org
>>Subject: [openib-general] CQ error handling in IPoIB
>>
>>Hi,
>>I have a question regarding error handling in IPoIB.
>>
>>The spec says...
>>
>>When a CQ encounters an error, in order to be able to use the CQ
>
> again,
>
>>the consumer should:
>>* Destroy all the QPs that are attached to the CQ
>>* Destroy the CQ
>>* Recreate the CQ through the Create Completion Queue verb
>>
>>While (at least one part of) the code does...
>>
>>static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc
>>*wc)
>>{
>> ...
>> ...
>> ...
>> if (wc->status != IB_WC_SUCCESS &&
>> wc->status != IB_WC_WR_FLUSH_ERR)
>> ipoib_warn(priv, "failed send event "
>> "(status=%d, wrid=%d vend_err %x)\n",
>> wc->status, wr_id, wc->vendor_err);
>>}
>>
>
>
> In this context the spec is referring to CQ errors, not work request
> errors. For example, CQ overflow is considered a CQ error and would
> require the procedure you describe above (destroy QPs, CQ, etc).
>
> However a work request error is a WQE or QP error. As such the CQ does
> not need to be destroyed. Rather the recovery will be limited to QP
> level actions. Typically the QP has moved to the error state and the QP
> must be reset and moved back to RTS to resume operation (or the QP must
> be destroyed and recreated).
>
> If you check section 10.10.3.4 of IBTA 1.2 you will see a list of
> possible errors on a UD QP. Notice that the errors all involve Local
> Protection or Operation errors. Hence they cannot be caused by a remote
> node. Rather, they are only caused by invalid local requests (by IPoIB
> in this case) or possibly by hardware or OS problems (memory stomps,
> multi-bit undetected memory or bus errors, HCA hardware problem, etc).
>
> As you indicate, when such error occur, the driver should recreate or
> reset the QP.
>
> Todd Rimmer
>
Thanks for the detailed answer.
I see my mistake in bringing that quote from the spec but the question
was answered even though.
More information about the general
mailing list