[openfabrics-ewg] Link Initialization problem and hangs in MTHCA on OFED-1.0

Fri Jun 23 14:36:13 PDT 2006

I was corresponding with Hal Rosenstock about this problem,  but he 
suggested that I resubmit to a wider audience.   The previous messages are 
under the subject of  "How do I use "madeye" to diagnose a problem?".   I 
was trying to use "madeye" to find out if any MAD packets were being 
received by a node in which the link fails to initialize.

I have a small two-node testbed system which consists of two EM64T 
machines ("koa" and "jatoba") cabled back-to-back with two Mellanox 
MT25204 (4x DDR) HCAs.   This configuration worked with a backported 
2.6.11-34 kernel and revision 6500 from the OpenIB svn trunk.   I was able 
to run basic tests and several sets of MPI benchmarks.

Since moving to a "2.6.16" kernel and the OFED-1.0 release,  we cannot get 
the link on the "jatoba" machine to come up.   The "madeye" module seems 
to show that no MAD packets are being received when the Subnet Manager is 
run on the other machine.   When I try to run SM on "jatoba",  or try to 
run any other program that uses MAD,  I get process hangs.   Here is a 
portion of the stack traces for one of the hung processes,  obtained by 
doing "echo t > /proc/sysrq-trigger" and looking at the dmesg output.

ibis          D 0000000000000003     0  5489   5097  5522 (NOTLB)
ffff8100788c7d28 ffff810037cb9030 ffff8100788c7c78 ffff81007c606640 
       ffffffff803c1b65 0000000000000001 ffffffff801350ce ffff810003392418 

       ffff8100788c6000 ffff8100788c7cb8 
Call Trace: <ffffffff803c1b65>{_spin_lock_irqsave+14}
       <ffffffff801350ce>{lock_timer_base+27} 
<ffffffff880c4a0d>{:ib_mthca:mthca_table_put+65}
       <ffffffff803c1c20>{_spin_unlock_irq+9} 
<ffffffff803bfd5f>{wait_for_completion+179}
       <ffffffff80127468>{default_wake_function+0} 
<ffffffff80127468>{default_wake_function+0}
       <ffffffff88023909>{:ib_mad:ib_cancel_rmpp_recvs+144}
       <ffffffff88020933>{:ib_mad:ib_unregister_mad_agent+1019}
       <ffffffff8803bc29>{:ib_umad:ib_umad_ioctl+564} 
<ffffffff80140025>{autoremove_wake_function+0}
       <ffffffff80180d4d>{do_ioctl+45} <ffffffff80181034>{vfs_ioctl+658}
       <ffffffff8018948e>{mntput_no_expire+28} 
<ffffffff80181083>{sys_ioctl+60}
       <ffffffff8010aa52>{system_call+126}

It seems to be a lock or mutex problem,  but I don't know how to proceed 
from here.

Some things I have tried are:

Connecting the two machines to a switch instead of back-to-back,  to use 
the SM in the switch.  The link to "koa" comes up, but the link to 
"jatoba" does not.

Physically swapping the two HCAs between the two machines:   the problem 
stays on the "jatoba" side.

Turning on "debug_level" traces with "modprobe ib_mthca debug_level=1" on 
both machines.   The traces seem to be identical on both, except for the 
actual PCI bus location and the memory addresses being mapped.  No 
additional traces are generated when the hangs occur.

The machines are both EM64T but are not identical.  The "koa" side has the 
HCA on PCI "06:00.0",  and the "jatoba" side has the HCA on "03:00.0". The 
two machines are:

   koa (the working one) is an Intel SE7520BD2 motherboard (7520 chip 
set).
   jatoba (the bad one) is an Intel SE7525GP2 motherboard (7525 chip set).

Can anyone suggest what to try or look at next?

        -Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20060623/d19f3f9a/attachment.html>