[ofa-general] ***SPAM*** Re: System crashed while booting Linux (ia64) with three Mellanox HCAs (15b3:6274)

Phillip Wilson phillipwils at gmail.com
Fri Mar 27 16:36:16 PDT 2009


I spent the last couple of days retracing my steps.  In my haste, I
listed the wrong HCA firmware revision.  It was  firmware 1.2.940 that
caused the system to crash while booting to Linux.  I have the mthca
driver built into the kernel; it is not a loadable driver.  The system
boots fine with the 1.2.0 firmware.

You are correct; the system crash does not make sense since the stack
was okay a few instruction earlier.  I am currently looking at the
error dump of the system log to see if I can find out more.  There is
no timing issue in the driver function ib_mad_post_receive_mads() and
the debug printk messages in this function do not solve the system
crash.


On Thu, Mar 26, 2009 at 8:54 AM, Roland Dreier <rdreier at cisco.com> wrote:
>  > System crashes with three Mellanox mezzanine cards (VID=15b3,
>  > DID=0x6274) installed when booting Linux (ia64).  I am using Linux
>  > 2.6.24, but this issue also occurs with Linux kernel 2.6.29-rc8.
>
> this is a pretty interesting crash.  Do you have the ib_mthca driver
> built into your kernel, or is it being loaded as a module?
>
>  > A partial listing from ib_mad_post_receive_mad.S is posted below the "C" code.
>  > The exact instruction that cause the system crash was located at
>  >
>  > ib_mad_post_*+0x0080           st4              [r2]=r3                      MII
>  >                                nop.i            0x0
>  >                                nop.i            0x0
>  >
>  > It tries to store r3=0x1600 to [r2] @ 0xE0000007E01C7CCC.
>
> Looking at the assembly, it seems the relevant parts are:
>
> ib_mad_post_*+0x0060           ld4              r3=[r11]                     MMI
>                               st8              [r2]=r8
>                               adds             r2=28,r12
> ib_mad_post_*+0x0070           st4              [r9]=r10                     MMI
>                               st8              [r45]=r0
>                               nop.i            0x0;;
> ib_mad_post_*+0x0080           st4              [r2]=r3                      MII
>
> The main points are "adds r2=28,r12" -- ie r2 now points into the
> stack -- and "st4 [r2]=r3" -- ie a store onto the stack is crashing.
>
> In the same function, we have "adds r9=56,r12" and "st4 [r9]=r10"
> slightly earlier, so the stack isn't totally messed up (apparently).
>
> Not sure how to debug this since the crash as it stands doesn't seem to
> make much sense...
>



More information about the general mailing list