[openfabrics-ewg] How do I use "madeye" to diagnose a problem?
Don.Albert at Bull.com
Don.Albert at Bull.com
Thu Jun 22 13:48:41 PDT 2006
Hal,
> Don,
>
> On Wed, 2006-06-21 at 15:46, Don.Albert at Bull.com wrote:
> >
> > I installed the madeye module on both of the EM64T systems and checked
for
> > packets with dmesg.
> >
> > On the "good" system ("koa"), it shows lots of packets where the SM
> > is trying to survey the network. On the "bad" system ("jatoba"), no
> > packets at all are captured. It seems that nothing is being received
> > on the jatoba side.
> >
> > > We may have gone through this (I don't remember) but can you try:
> > > 1. Is the firmware version on the node which is not working, the
> > same as
> > > the one which does ?
> >
> > The firmware version on both MT25204 cards is reported as 1.0.800by
> > the ibstat command.
> >
> > > 2. Are you sure the cable is plugged in properly ? Do you have
> > another
> > > cable to try ?
> >
> > We have tried switching cables, cabling the HCAs to a switch instead
> > of back-to-back, and this morning we tried swapping the HCA cards
> > between the two machines. The problem stays on the "jatoba"
> > machine. There is a difference in the hardware location on the two
> > machines, in that the PCI bus configuration is different.
>
> Other than that are the 2 machines identical ? Does that include the
> BIOS version ?
I don't know the bios version, but the machines are not identical:
koa (the working one) is an Intel SE7520BD2 motherboard (7520 chip
set).
jatoba (the bad one) is an Intel SE7525GP2 motherboard (7525 chip set).
>
> > The "koa" side has the HCA on PCI "06:00.0", and the "jatoba" side
> > has the HCA on "03:00.0".
>
> The problem is down at the driver level or lower. Can you enable mthca
> debug_level module parameter (the driver would need to have been build
> with CONFIG_INFINIBAND_MTHCA_DEBUG enabled) and see if anything
> interesting shows up ?
>
I finally got a version of mthca built which allows me to turn on the
debug_level. I loaded it on both machines and looked at the dmesg
output. The traces are virtually identical between the two, except for
the PCI address and the memory addresses for the regions being mapped.
When I run a test that gets hung, there are no additional traces from
mthca.
>
> > When I try to stop SM on jatoba with the "/etc/init.d/opensmd stop"
> > script, the script hangs (keeps printing out dots and never
> > terminates) until I break out of it. The OpenSM process stays hung
> > in execution. Doing a "cat /proc/<pid>/wchan" shows the OpenSM
> > process waiting in "ib_unregister_mad_agent".
>
> Sounds like a lock or mutex is being held in this case which is blocking
> the unregister. Not sure what it could be but this is a separate issue.
>
When I run "ibdiagnet" I get the following terminal output, and then the
terminal stops responding:
[jatoba] (ib) mthca> ibdiagnet
Loading IBDIAGNET from: /opt/ofed/lib64/ibdiagnet1.0
Loading IBDM from: /opt/ofed/lib64/ibdm1.0
-W- Topology file is not specified.
Reports regarding cluster links will use direct routes.
-I- Discovering the subnet ... 1 nodes (0 Switches & 1 CA-s) discovered.
-E- Discovery at local link failed: smNodeInfoMad getByDr 1 - failed 4
consecutive times.
Exiting.
The "ps axjf" command (from another terminal) shows the following process
tree:
4456 5025 5025 5025 ? -1 Ss 0 0:00 \_ in.telnetd:
albertpc.usnetwork.lan
5025 5026 5026 5026 ? -1 Ss 0 0:00 | \_ login --
ib
5026 5027 5027 5027 pts/0 5489 Ss 500 0:00 | \_ -bash
5027 5095 5095 5027 pts/0 5489 S 0 0:00 | \_ su
5095 5097 5097 5027 pts/0 5489 S 0 0:00 | \_ bash
5097 5489 5489 5027 pts/0 5489 D+ 0 0:00 | \_ ibis
5489 5522 5489 5027 pts/0 5489 S+ 0 0:00 | \_ ibis
5522 5523 5489 5027 pts/0 5489 S+ 0 0:00 | \_
ibis
5522 5524 5489 5027 pts/0 5489 S+ 0 0:00 | \_
ibis
5522 5525 5489 5027 pts/0 5489 S+ 0 0:00 | \_
ibis
The PID 5489 is in an "uninterruptable sleep" state. Doing a "cat
/proc/5489/wchan" shows "ib_unregister_mad_agent". I also did "echo t
> /proc/sysrq-trigger" to get stack traces. Here is the entry for PID
5489:
ibis D 0000000000000003 0 5489 5097 5522 (NOTLB)
ffff8100788c7d28 ffff810037cb9030 ffff8100788c7c78 ffff81007c606640
ffffffff803c1b65 0000000000000001 ffffffff801350ce ffff810003392418
ffff8100788c6000 ffff8100788c7cb8
Call Trace: <ffffffff803c1b65>{_spin_lock_irqsave+14}
<ffffffff801350ce>{lock_timer_base+27}
<ffffffff880c4a0d>{:ib_mthca:mthca_table_put+65}
<ffffffff803c1c20>{_spin_unlock_irq+9}
<ffffffff803bfd5f>{wait_for_completion+179}
<ffffffff80127468>{default_wake_function+0}
<ffffffff80127468>{default_wake_function+0}
<ffffffff88023909>{:ib_mad:ib_cancel_rmpp_recvs+144}
<ffffffff88020933>{:ib_mad:ib_unregister_mad_agent+1019}
<ffffffff8803bc29>{:ib_umad:ib_umad_ioctl+564}
<ffffffff80140025>{autoremove_wake_function+0}
<ffffffff80180d4d>{do_ioctl+45} <ffffffff80181034>{vfs_ioctl+658}
<ffffffff8018948e>{mntput_no_expire+28}
<ffffffff80181083>{sys_ioctl+60}
<ffffffff8010aa52>{system_call+126}
I think I agree that this is a lock or mutex problem. How can I
determine which lock or mutex is being held, resulting in the hang?
-Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20060622/84f9a400/attachment.html>
More information about the ewg
mailing list