[openfabrics-ewg] How do I use "madeye" to diagnose a problem?

Thu Jun 22 13:48:41 PDT 2006

Hal,

> Don,
> 
> On Wed, 2006-06-21 at 15:46, Don.Albert at Bull.com wrote:

> > 
> > I installed the madeye module on both of the EM64T systems and checked 
for
> > packets with dmesg.
> > 
> > On the "good" system ("koa"),  it shows lots of packets where the SM
> > is trying to survey the network.   On the "bad" system ("jatoba"),  no
> > packets at all are captured.  It seems that nothing is being received
> > on the jatoba side.
> > 
> > > We may have gone through this (I don't remember) but can you try:
> > > 1. Is the firmware version on the node which is not working, the
> > same as
> > > the one which does ?
> > 
> > The firmware version on both MT25204 cards is reported as 1.0.800by
> > the ibstat command.
> > 
> > > 2. Are you sure the cable is plugged in properly ? Do you have
> > another
> > > cable to try ?
> > 
> > We have tried switching cables,  cabling the HCAs to a switch instead
> > of back-to-back,  and this morning we tried swapping the HCA cards
> > between the two machines.  The problem stays on the "jatoba"
> > machine.    There is a difference in the hardware location on the two
> > machines, in that the PCI bus configuration is different.
> 
> Other than that are the 2 machines identical ? Does that include the
> BIOS version ?

I don't know the bios version, but the machines are not identical:

   koa (the working one) is an Intel SE7520BD2 motherboard (7520 chip 
set).
   jatoba (the bad one) is an Intel SE7525GP2 motherboard (7525 chip set).

> 
> >    The "koa" side has the HCA on PCI "06:00.0",  and the "jatoba" side
> > has the HCA on "03:00.0".
> 
> The problem is down at the driver level or lower. Can you enable mthca
> debug_level module parameter (the driver would need to have been build
> with CONFIG_INFINIBAND_MTHCA_DEBUG enabled) and see if anything
> interesting shows up ?
> 

I finally got a version of mthca built which allows me to turn on the 
debug_level.   I loaded it on both machines and looked at the dmesg 
output.  The traces are virtually identical between the two, except for 
the PCI address and the memory addresses for the regions being mapped. 
When I run a test that gets hung, there are no additional traces from 
mthca.

> 
> > When I try to stop SM on jatoba with the "/etc/init.d/opensmd stop"
> > script,  the script hangs (keeps printing out dots and never
> > terminates) until I break out of it.   The OpenSM process stays hung
> > in execution.   Doing a "cat /proc/<pid>/wchan"  shows the OpenSM
> > process waiting in "ib_unregister_mad_agent".
> 
> Sounds like a lock or mutex is being held in this case which is blocking
> the unregister. Not sure what it could be but this is a separate issue.
> 

When I run "ibdiagnet"  I get the following terminal output, and then the 
terminal stops responding:

[jatoba] (ib) mthca> ibdiagnet
Loading IBDIAGNET from: /opt/ofed/lib64/ibdiagnet1.0
Loading IBDM from: /opt/ofed/lib64/ibdm1.0
-W- Topology file is not specified.
    Reports regarding cluster links will use direct routes.
-I- Discovering the subnet ... 1 nodes (0 Switches & 1 CA-s) discovered.
-E- Discovery at local link failed: smNodeInfoMad getByDr 1 - failed 4
    consecutive times.
    Exiting.

The "ps axjf"  command (from another terminal) shows the following process 
tree:

4456  5025  5025  5025 ?           -1 Ss       0   0:00  \_ in.telnetd: 
albertpc.usnetwork.lan
5025  5026  5026  5026 ?           -1 Ss       0   0:00  |   \_ login -- 
ib
5026  5027  5027  5027 pts/0     5489 Ss     500   0:00  |       \_ -bash
5027  5095  5095  5027 pts/0     5489 S        0   0:00  |           \_ su
5095  5097  5097  5027 pts/0     5489 S        0   0:00  | \_ bash
5097  5489  5489  5027 pts/0     5489 D+       0   0:00  |   \_ ibis
5489  5522  5489  5027 pts/0     5489 S+       0   0:00  |       \_ ibis
5522  5523  5489  5027 pts/0     5489 S+       0   0:00  |           \_ 
ibis
5522  5524  5489  5027 pts/0     5489 S+       0   0:00  |           \_ 
ibis
5522  5525  5489  5027 pts/0     5489 S+       0   0:00  |           \_ 
ibis

The PID 5489 is in an "uninterruptable sleep"  state.   Doing a  "cat 
/proc/5489/wchan" shows  "ib_unregister_mad_agent".    I also did "echo t 
> /proc/sysrq-trigger" to get stack traces.  Here is the entry for PID 
5489:

ibis          D 0000000000000003     0  5489   5097  5522 (NOTLB)
ffff8100788c7d28 ffff810037cb9030 ffff8100788c7c78 ffff81007c606640 
       ffffffff803c1b65 0000000000000001 ffffffff801350ce ffff810003392418 

       ffff8100788c6000 ffff8100788c7cb8 
Call Trace: <ffffffff803c1b65>{_spin_lock_irqsave+14}
       <ffffffff801350ce>{lock_timer_base+27} 
<ffffffff880c4a0d>{:ib_mthca:mthca_table_put+65}
       <ffffffff803c1c20>{_spin_unlock_irq+9} 
<ffffffff803bfd5f>{wait_for_completion+179}
       <ffffffff80127468>{default_wake_function+0} 
<ffffffff80127468>{default_wake_function+0}
       <ffffffff88023909>{:ib_mad:ib_cancel_rmpp_recvs+144}
       <ffffffff88020933>{:ib_mad:ib_unregister_mad_agent+1019}
       <ffffffff8803bc29>{:ib_umad:ib_umad_ioctl+564} 
<ffffffff80140025>{autoremove_wake_function+0}
       <ffffffff80180d4d>{do_ioctl+45} <ffffffff80181034>{vfs_ioctl+658}
       <ffffffff8018948e>{mntput_no_expire+28} 
<ffffffff80181083>{sys_ioctl+60}
       <ffffffff8010aa52>{system_call+126}

I think I agree that this is a lock or mutex problem.   How can I 
determine which lock or mutex is being held, resulting in the hang?

        -Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20060622/84f9a400/attachment.html>