[openfabrics-ewg] Link Initialization problem and hangs in MTHCA on OFED-1.0
Don.Albert at Bull.com
Don.Albert at Bull.com
Fri Jun 23 14:36:13 PDT 2006
I was corresponding with Hal Rosenstock about this problem, but he
suggested that I resubmit to a wider audience. The previous messages are
under the subject of "How do I use "madeye" to diagnose a problem?". I
was trying to use "madeye" to find out if any MAD packets were being
received by a node in which the link fails to initialize.
I have a small two-node testbed system which consists of two EM64T
machines ("koa" and "jatoba") cabled back-to-back with two Mellanox
MT25204 (4x DDR) HCAs. This configuration worked with a backported
2.6.11-34 kernel and revision 6500 from the OpenIB svn trunk. I was able
to run basic tests and several sets of MPI benchmarks.
Since moving to a "2.6.16" kernel and the OFED-1.0 release, we cannot get
the link on the "jatoba" machine to come up. The "madeye" module seems
to show that no MAD packets are being received when the Subnet Manager is
run on the other machine. When I try to run SM on "jatoba", or try to
run any other program that uses MAD, I get process hangs. Here is a
portion of the stack traces for one of the hung processes, obtained by
doing "echo t > /proc/sysrq-trigger" and looking at the dmesg output.
ibis D 0000000000000003 0 5489 5097 5522 (NOTLB)
ffff8100788c7d28 ffff810037cb9030 ffff8100788c7c78 ffff81007c606640
ffffffff803c1b65 0000000000000001 ffffffff801350ce ffff810003392418
ffff8100788c6000 ffff8100788c7cb8
Call Trace: <ffffffff803c1b65>{_spin_lock_irqsave+14}
<ffffffff801350ce>{lock_timer_base+27}
<ffffffff880c4a0d>{:ib_mthca:mthca_table_put+65}
<ffffffff803c1c20>{_spin_unlock_irq+9}
<ffffffff803bfd5f>{wait_for_completion+179}
<ffffffff80127468>{default_wake_function+0}
<ffffffff80127468>{default_wake_function+0}
<ffffffff88023909>{:ib_mad:ib_cancel_rmpp_recvs+144}
<ffffffff88020933>{:ib_mad:ib_unregister_mad_agent+1019}
<ffffffff8803bc29>{:ib_umad:ib_umad_ioctl+564}
<ffffffff80140025>{autoremove_wake_function+0}
<ffffffff80180d4d>{do_ioctl+45} <ffffffff80181034>{vfs_ioctl+658}
<ffffffff8018948e>{mntput_no_expire+28}
<ffffffff80181083>{sys_ioctl+60}
<ffffffff8010aa52>{system_call+126}
It seems to be a lock or mutex problem, but I don't know how to proceed
from here.
Some things I have tried are:
Connecting the two machines to a switch instead of back-to-back, to use
the SM in the switch. The link to "koa" comes up, but the link to
"jatoba" does not.
Physically swapping the two HCAs between the two machines: the problem
stays on the "jatoba" side.
Turning on "debug_level" traces with "modprobe ib_mthca debug_level=1" on
both machines. The traces seem to be identical on both, except for the
actual PCI bus location and the memory addresses being mapped. No
additional traces are generated when the hangs occur.
The machines are both EM64T but are not identical. The "koa" side has the
HCA on PCI "06:00.0", and the "jatoba" side has the HCA on "03:00.0". The
two machines are:
koa (the working one) is an Intel SE7520BD2 motherboard (7520 chip
set).
jatoba (the bad one) is an Intel SE7525GP2 motherboard (7525 chip set).
Can anyone suggest what to try or look at next?
-Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20060623/d19f3f9a/attachment.html>
More information about the ewg
mailing list