[openfabrics-ewg] How do I use "madeye" to diagnose a problem?

Hal Rosenstock halr at voltaire.com
Wed Jun 21 13:37:14 PDT 2006


Don,

On Wed, 2006-06-21 at 15:46, Don.Albert at Bull.com wrote:
> Hal,
> 
> Hal Rosenstock <halr at voltaire.com> wrote on 06/21/2006 03:26:45 AM:
> 
> > You need to modprobe ib_madeye
> > 
> > The madeye module has 5 module parameters:
> > 
> > MODULE_PARM_DESC(smp, "Display all SMPs (default=1)");
> > MODULE_PARM_DESC(gmp, "Display all GMPs (default=1)");
> > MODULE_PARM_DESC(mgmt_class, "Display all MADs of specified class 
> > (default=0)");
> > MODULE_PARM_DESC(attr_id, "Display add MADs of specified attribute 
> > ID (default=0)");
> > MODULE_PARM_DESC(data, "Display data area of MADs (default=0)");
> > 
> > Given your symptoms, the default settings should be fine except I
> would
> > change the data one to 1 so the data is displayed. I doubt the node
> is
> > even seeing the incoming SMPs for some unknown reason.
> > 
> > So:
> > /sbin/modprobe ib_madeye data=1
> > 
> > We may narrow it down from there. You can see the output in
> > /var/log/messages or with dmesg.
> 
> I installed the module on both of the EM64T systems and checked for
> packets with dmesg.
> 
> On the "good" system ("koa"),  it shows lots of packets where the SM
> is trying to survey the network.   On the "bad" system ("jatoba"),  no
> packets at all are captured.  It seems that nothing is being received
> on the jatoba side.
> 
> > We may have gone through this (I don't remember) but can you try:
> > 1. Is the firmware version on the node which is not working, the
> same as
> > the one which does ?
> 
> The firmware version on both MT25204 cards is reported as 1.0.800by
> the ibstat command.
> 
> > 2. Are you sure the cable is plugged in properly ? Do you have
> another
> > cable to try ?
> 
> We have tried switching cables,  cabling the HCAs to a switch instead
> of back-to-back,  and this morning we tried swapping the HCA cards
> between the two machines.  The problem stays on the "jatoba"
> machine.    There is a difference in the hardware location on the two
> machines, in that the PCI bus configuration is different.

Other than that are the 2 machines identical ? Does that include the
BIOS version ?

>    The "koa" side has the HCA on PCI "06:00.0",  and the "jatoba" side
> has the HCA on "03:00.0".

The problem is down at the driver level or lower. Can you enable mthca
debug_level module parameter (the driver would need to have been build
with CONFIG_INFINIBAND_MTHCA_DEBUG enabled) and see if anything
interesting shows up ?

> > 3. Can you reverse the SM and non SM roles and see how this behaves
> ?
> 
> I brought down the SM on the "koa" side and started it on the "jatoba"
> side with script "/etc/init.d/opensmd start".   The dmesg output shows
> that madeye captured exactly 10 packets,  then no more,  even after
> many minutes.
>    I have attached a file with the captured packets to this email.

I didn't see any attachment but I don't think it matters given what you
have reported.

> When I try to stop SM on jatoba with the "/etc/init.d/opensmd stop"
> script,  the script hangs (keeps printing out dots and never
> terminates) until I break out of it.   The OpenSM process stays hung
> in execution.   Doing a "cat /proc/<pid>/wchan"  shows the OpenSM
> process waiting in "ib_unregister_mad_agent".

Sounds like a lock or mutex is being held in this case which is blocking
the unregister. Not sure what it could be but this is a separate issue.

> If I try to do other tests on "jatoba" like "ibdiagnet"  they also
> hang.

It won't work as SMPs are not being received and this also relies on
SMPs. I'm not sure it shouldn't hang though.

-- Hal

>    The only thing in /var/log/osm.logis:
> 
> Jun 21 08:48:47 345379 [66A1FCA0] -> OpenSM Rev:openib-1.2.1 OpenIB
> svn Exported revision
> Jun 21 08:48:47 351128 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn
> Exported revision
> 
> Jun 21 09:45:53 464987 [18E18CA0] -> OpenSM Rev:openib-1.2.1 OpenIB
> svn Exported revision
> Jun 21 09:45:53 472677 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn
> Exported revision
> 
> Jun 21 09:45:53 489398 [18E18CA0] -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe8000000000000
> 0,0x0000000000000000
> Jun 21 09:45:53 489461 [18E18CA0] -> osm_report_notice: Reporting
> Generic Notice type:3 num:66 from LID:0x0000 GID:0xfe8000000000000
> 0,0x0000000000000000
> Jun 21 09:45:53 491919 [18E18CA0] -> osm_vendor_bind: Binding to port
> 0x2c90200216dc5
> Jun 21 09:45:53 493583 [18E18CA0] -> osm_vendor_bind: Binding to port
> 0x2c90200216dc5
> Jun 21 12:21:34 420066 [0000] -> Exiting SM
> 
> Thanks for taking a look at this.
> 
>         -Don Albert-





More information about the ewg mailing list