[openfabrics-ewg] How do I use "madeye" to diagnose a problem?

Don.Albert at Bull.com Don.Albert at Bull.com
Wed Jun 21 12:46:02 PDT 2006


Hal,

Hal Rosenstock <halr at voltaire.com> wrote on 06/21/2006 03:26:45 AM:

> You need to modprobe ib_madeye
> 
> The madeye module has 5 module parameters:
> 
> MODULE_PARM_DESC(smp, "Display all SMPs (default=1)");
> MODULE_PARM_DESC(gmp, "Display all GMPs (default=1)");
> MODULE_PARM_DESC(mgmt_class, "Display all MADs of specified class 
> (default=0)");
> MODULE_PARM_DESC(attr_id, "Display add MADs of specified attribute 
> ID (default=0)");
> MODULE_PARM_DESC(data, "Display data area of MADs (default=0)");
> 
> Given your symptoms, the default settings should be fine except I would
> change the data one to 1 so the data is displayed. I doubt the node is
> even seeing the incoming SMPs for some unknown reason.
> 
> So:
> /sbin/modprobe ib_madeye data=1
> 
> We may narrow it down from there. You can see the output in
> /var/log/messages or with dmesg.

I installed the module on both of the EM64T systems and checked for 
packets with dmesg.

On the "good" system ("koa"),  it shows lots of packets where the SM is 
trying to survey the network.   On the "bad" system ("jatoba"),  no 
packets at all are captured.  It seems that nothing is being received on 
the jatoba side.

> We may have gone through this (I don't remember) but can you try:
> 1. Is the firmware version on the node which is not working, the same as
> the one which does ?

The firmware version on both MT25204 cards is reported as 1.0.800 by the 
ibstat command.

> 2. Are you sure the cable is plugged in properly ? Do you have another
> cable to try ?

We have tried switching cables,  cabling the HCAs to a switch instead of 
back-to-back,  and this morning we tried swapping the HCA cards between 
the two machines.  The problem stays on the "jatoba" machine.    There is 
a difference in the hardware location on the two machines, in that the PCI 
bus configuration is different.   The "koa" side has the HCA on PCI "
06:00.0",  and the "jatoba" side has the HCA on "03:00.0".

> 3. Can you reverse the SM and non SM roles and see how this behaves ?

I brought down the SM on the "koa" side and started it on the "jatoba" 
side with script "/etc/init.d/opensmd start".   The dmesg output shows 
that madeye captured exactly 10 packets,  then no more,  even after many 
minutes.   I have attached a file with the captured packets to this email.

When I try to stop SM on jatoba with the "/etc/init.d/opensmd stop" 
script,  the script hangs (keeps printing out dots and never terminates) 
until I break out of it.   The OpenSM process stays hung in execution. 
Doing a "cat /proc/<pid>/wchan"  shows the OpenSM process waiting in "
ib_unregister_mad_agent".

If I try to do other tests on "jatoba" like "ibdiagnet"  they also hang. 
The only thing in /var/log/osm.log is:

Jun 21 08:48:47 345379 [66A1FCA0] -> OpenSM Rev:openib-1.2.1 OpenIB svn 
Exported revision
Jun 21 08:48:47 351128 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 
Exported revision

Jun 21 09:45:53 464987 [18E18CA0] -> OpenSM Rev:openib-1.2.1 OpenIB svn 
Exported revision
Jun 21 09:45:53 472677 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn 
Exported revision

Jun 21 09:45:53 489398 [18E18CA0] -> osm_report_notice: Reporting Generic 
Notice type:3 num:66 from LID:0x0000 GID:0xfe8000000000000
0,0x0000000000000000
Jun 21 09:45:53 489461 [18E18CA0] -> osm_report_notice: Reporting Generic 
Notice type:3 num:66 from LID:0x0000 GID:0xfe8000000000000
0,0x0000000000000000
Jun 21 09:45:53 491919 [18E18CA0] -> osm_vendor_bind: Binding to port 
0x2c90200216dc5
Jun 21 09:45:53 493583 [18E18CA0] -> osm_vendor_bind: Binding to port 
0x2c90200216dc5
Jun 21 12:21:34 420066 [0000] -> Exiting SM

Thanks for taking a look at this.

        -Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20060621/22ac0ab4/attachment.html>


More information about the ewg mailing list