[ofa-general] ***SPAM*** Re: System crashed while booting Linux (ia64) with three Mellanox HCAs (15b3:6274)
Phillip Wilson
phillipwils at gmail.com
Fri Mar 27 16:36:16 PDT 2009
I spent the last couple of days retracing my steps. In my haste, I
listed the wrong HCA firmware revision. It was firmware 1.2.940 that
caused the system to crash while booting to Linux. I have the mthca
driver built into the kernel; it is not a loadable driver. The system
boots fine with the 1.2.0 firmware.
You are correct; the system crash does not make sense since the stack
was okay a few instruction earlier. I am currently looking at the
error dump of the system log to see if I can find out more. There is
no timing issue in the driver function ib_mad_post_receive_mads() and
the debug printk messages in this function do not solve the system
crash.
On Thu, Mar 26, 2009 at 8:54 AM, Roland Dreier <rdreier at cisco.com> wrote:
> > System crashes with three Mellanox mezzanine cards (VID=15b3,
> > DID=0x6274) installed when booting Linux (ia64). I am using Linux
> > 2.6.24, but this issue also occurs with Linux kernel 2.6.29-rc8.
>
> this is a pretty interesting crash. Do you have the ib_mthca driver
> built into your kernel, or is it being loaded as a module?
>
> > A partial listing from ib_mad_post_receive_mad.S is posted below the "C" code.
> > The exact instruction that cause the system crash was located at
> >
> > ib_mad_post_*+0x0080 st4 [r2]=r3 MII
> > nop.i 0x0
> > nop.i 0x0
> >
> > It tries to store r3=0x1600 to [r2] @ 0xE0000007E01C7CCC.
>
> Looking at the assembly, it seems the relevant parts are:
>
> ib_mad_post_*+0x0060 ld4 r3=[r11] MMI
> st8 [r2]=r8
> adds r2=28,r12
> ib_mad_post_*+0x0070 st4 [r9]=r10 MMI
> st8 [r45]=r0
> nop.i 0x0;;
> ib_mad_post_*+0x0080 st4 [r2]=r3 MII
>
> The main points are "adds r2=28,r12" -- ie r2 now points into the
> stack -- and "st4 [r2]=r3" -- ie a store onto the stack is crashing.
>
> In the same function, we have "adds r9=56,r12" and "st4 [r9]=r10"
> slightly earlier, so the stack isn't totally messed up (apparently).
>
> Not sure how to debug this since the crash as it stands doesn't seem to
> make much sense...
>
More information about the general
mailing list