[ofw] crash in mlx4 driver - catastrophic error - NDI modify issue

Fab Tillier ftillier at windows.microsoft.com
Tue Mar 24 18:11:56 PDT 2009


> An odd way to end the test, to be sure, but it still shouldn't crash the
> system. I hacked libibverbs to skip the RESET and INIT transitions and
> go directly to ERROR.  This avoids the catastrophic error and related
> crashes.  (Going from RTS->ERROR->RESET->INIT->ERROR still hits the
> error.)

When the system crashes, is it in the process of doing the RESET or INIT transition?  I don't think there's a lot of code that does a RESET transition so it's possible that is an untested path.

> While tracing through the code, I noticed at least one issue with the
> ndi_modify_qp() call.  When transitioning to RESET, all completions
> should be silently discarded by the HCA.  This doesn't occur because the
> CQ cleanup gets skipped in the kernel, since it's a userspace QP, but
> there's no code in userspace to perform the cleanup, similar to what
> post_modify_qp() does.

Why should valid CQEs be discarded from the CQ?  The IB spec says "Outstanding Work Requests are removed from the queues without notifying the Consumer".

> I don't know if this leads to the catastrophic error or not.  I also
> don't know if the cleanup can be done in the kernel, or if it requires
> userspace to do it. (I'm guessing the latter, but whether CQ entries
> get generated or not seems like a pretty minor issue.  There seems
> like a race with this.)

Discarding the CQEs would likely need to be done in user-mode.  It would be possible for the UVP to issue the RESET transition synchronously, perform the CQE cleanup, and then do a 'noop' async operation that simply returns the status.  The overhead of the extra IOCTL is likely minimal compared to the QP modify HCA command processing time.

-Fab



More information about the ofw mailing list