[ofw] crash in mlx4 driver - catastrophic error - NDI modify issue
Fab Tillier
ftillier at windows.microsoft.com
Tue Mar 24 18:11:56 PDT 2009
> An odd way to end the test, to be sure, but it still shouldn't crash the
> system. I hacked libibverbs to skip the RESET and INIT transitions and
> go directly to ERROR. This avoids the catastrophic error and related
> crashes. (Going from RTS->ERROR->RESET->INIT->ERROR still hits the
> error.)
When the system crashes, is it in the process of doing the RESET or INIT transition? I don't think there's a lot of code that does a RESET transition so it's possible that is an untested path.
> While tracing through the code, I noticed at least one issue with the
> ndi_modify_qp() call. When transitioning to RESET, all completions
> should be silently discarded by the HCA. This doesn't occur because the
> CQ cleanup gets skipped in the kernel, since it's a userspace QP, but
> there's no code in userspace to perform the cleanup, similar to what
> post_modify_qp() does.
Why should valid CQEs be discarded from the CQ? The IB spec says "Outstanding Work Requests are removed from the queues without notifying the Consumer".
> I don't know if this leads to the catastrophic error or not. I also
> don't know if the cleanup can be done in the kernel, or if it requires
> userspace to do it. (I'm guessing the latter, but whether CQ entries
> get generated or not seems like a pretty minor issue. There seems
> like a race with this.)
Discarding the CQEs would likely need to be done in user-mode. It would be possible for the UVP to issue the RESET transition synchronously, perform the CQE cleanup, and then do a 'noop' async operation that simply returns the status. The overhead of the extra IOCTL is likely minimal compared to the QP modify HCA command processing time.
-Fab
More information about the ofw
mailing list