[ofa-general] Re: System crashed while booting Linux (ia64) with three Mellanox HCAs (15b3:6274)

Roland Dreier rdreier at cisco.com
Fri Mar 27 16:48:25 PDT 2009


 > I spent the last couple of days retracing my steps.  In my haste, I
 > listed the wrong HCA firmware revision.  It was  firmware 1.2.940 that
 > caused the system to crash while booting to Linux.  I have the mthca
 > driver built into the kernel; it is not a loadable driver.  The system
 > boots fine with the 1.2.0 firmware.

Oh, it's mthca firmware version dependent?  That's a big clue: you're
using mem-free firmware, which means the HCA uses system memory to store
big chunks of internal state.  If something is going wrong with how the
memory is mapped to the HCA (or how the HCA writes to it) then that
could cause memory corruption -- possibly tied to posting receives to
the hardware as part of the MAD initialization.

So it could be a driver bug exposed by the new firmware, or a firmware bug.

Is Mellanox following this bug?  Maybe they have some idea of how to
figure out what the HCA is doing that could crash a system.

 - R.



More information about the general mailing list