[openib-general] CQ error handling in IPoIB
Todd Rimmer
todd.rimmer at qlogic.com
Tue Jan 2 10:39:40 PST 2007
> From: Moni Shoua
> Sent: Tuesday, January 02, 2007 11:31 AM
> To: openib-general at openib.org
> Subject: [openib-general] CQ error handling in IPoIB
>
> Hi,
> I have a question regarding error handling in IPoIB.
>
> The spec says...
>
> When a CQ encounters an error, in order to be able to use the CQ
again,
> the consumer should:
> * Destroy all the QPs that are attached to the CQ
> * Destroy the CQ
> * Recreate the CQ through the Create Completion Queue verb
>
> While (at least one part of) the code does...
>
> static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc
> *wc)
> {
> ...
> ...
> ...
> if (wc->status != IB_WC_SUCCESS &&
> wc->status != IB_WC_WR_FLUSH_ERR)
> ipoib_warn(priv, "failed send event "
> "(status=%d, wrid=%d vend_err %x)\n",
> wc->status, wr_id, wc->vendor_err);
> }
>
In this context the spec is referring to CQ errors, not work request
errors. For example, CQ overflow is considered a CQ error and would
require the procedure you describe above (destroy QPs, CQ, etc).
However a work request error is a WQE or QP error. As such the CQ does
not need to be destroyed. Rather the recovery will be limited to QP
level actions. Typically the QP has moved to the error state and the QP
must be reset and moved back to RTS to resume operation (or the QP must
be destroyed and recreated).
If you check section 10.10.3.4 of IBTA 1.2 you will see a list of
possible errors on a UD QP. Notice that the errors all involve Local
Protection or Operation errors. Hence they cannot be caused by a remote
node. Rather, they are only caused by invalid local requests (by IPoIB
in this case) or possibly by hardware or OS problems (memory stomps,
multi-bit undetected memory or bus errors, HCA hardware problem, etc).
As you indicate, when such error occur, the driver should recreate or
reset the QP.
Todd Rimmer
More information about the general
mailing list