[ofw] crash in mlx4 driver - catastrophic error - NDI modify issue

Tue Mar 24 21:27:59 PDT 2009

>When the system crashes, is it in the process of doing the RESET or INIT
>transition?  I don't think there's a lot of code that does a RESET transition
>so it's possible that is an untested path.

The crash occurs in ipoib trying to handle an HCA reset.  The HCA reset is
caused by a catastrophic error being reported for some reason that is related to
the QP transition from RTS -> RESET -> INIT.  The catastrophic error occurs long
(relatively speaking) after the QP transition has occurred.  The QP in question
has been destroyed, the CQs destroyed, the PD deallocated, memory deregistered,
and the CA closed.

Note that I also see a CQ overrun during the QP transitions that I don't see if
it goes from RTS -> ERROR.

>Why should valid CQEs be discarded from the CQ?  The IB spec says "Outstanding
>Work Requests are removed from the queues without notifying the Consumer".

I'm just reporting that there's code that does some cleanup that's not there for
the ndi_modify_qp call, but does exist for kernel QPs and the post_modify_qp
call.  I didn't look into details at what the call did.  I'm hoping someone more
familiar with the code will look into the two areas I mentioned and determine
whether there's a real problem there or not.

>Discarding the CQEs would likely need to be done in user-mode.  It would be
>possible for the UVP to issue the RESET transition synchronously, perform the
>CQE cleanup, and then do a 'noop' async operation that simply returns the
>status.  The overhead of the extra IOCTL is likely minimal compared to the QP
>modify HCA command processing time.

Eventually, I would like the UVP to issue commands directly to the winverbs
driver and avoid the entire pre/post overhead.  That's much longer term...

- Sean