[ofw] crash in mlx4 driver - catastrophic error - NDI modify issue
Sean Hefty
sean.hefty at intel.com
Tue Mar 24 16:42:58 PDT 2009
I took a detour to get dtest running between Linux and Windows. The
catastrophic error is related to the client-side of dtest and occurs because of
the following QP transitions after the test completes:
QP transitions: RTS -> RESET -> INIT -> ERROR
An odd way to end the test, to be sure, but it still shouldn't crash the system.
I hacked libibverbs to skip the RESET and INIT transitions and go directly to
ERROR. This avoids the catastrophic error and related crashes. (Going from
RTS->ERROR->RESET->INIT->ERROR still hits the error.)
While tracing through the code, I noticed at least one issue with the
ndi_modify_qp() call. When transitioning to RESET, all completions should be
silently discarded by the HCA. This doesn't occur because the CQ cleanup gets
skipped in the kernel, since it's a userspace QP, but there's no code in
userspace to perform the cleanup, similar to what post_modify_qp() does.
I don't know if this leads to the catastrophic error or not. I also don't know
if the cleanup can be done in the kernel, or if it requires userspace to do it.
(I'm guessing the latter, but whether CQ entries get generated or not seems like
a pretty minor issue. There seems like a race with this.)
I also have a note that the kernel clears some 'ownership bits' when
transitioning from RESET to INIT (see __mlx4_ib_modify_qp). This also gets
skipped for userspace QPs. I don't know if this is needed or not.
- Sean
More information about the ofw
mailing list