[ofw] crash in mlx4 driver - or maybe it's an ipoib issue?

Fab Tillier ftillier at windows.microsoft.com
Mon Mar 16 11:06:29 PDT 2009


> As for the ipoib code: I haven't looked at it thoroughly but the driver
> is not ready to handle fatal error of the hw. For example if sends are
> being posted on a QP, than there is no way for it to free this memory
> unless it reads the completion from the CQ. Changing this is not only
> complicated but it also has performance impact.

I would think that many ULPs have this problem.  Why can't the HCA driver keep track of outstanding work requests (it does already to prevent QP overrun), and then generate synthetic completions to the CQ (which would be safe since the HW stops writing to it) and report them?  Seems like minimal logic that would let the ULPs handle a fatal HW error the same way as a failed completion (which generally are already handled.)

Further, the HCA driver could, when recovering from a fatal error, setup certain resources again without a problem.  Memory registrations could be restored, as could completion queues and UD QPs.  This same logic could be used to improve sleep/resume functionality too.

I'd really rather not see fatal error handling get duplicated in every ULP when it could be handled in a single location, and potentially handled more intelligently.

-Fab




More information about the ofw mailing list