[openfabrics-ewg] How do I use "madeye" to diagnose a problem?
Don.Albert at Bull.com
Don.Albert at Bull.com
Wed Jun 21 12:46:02 PDT 2006
Hal,
Hal Rosenstock <halr at voltaire.com> wrote on 06/21/2006 03:26:45 AM:
> You need to modprobe ib_madeye
>
> The madeye module has 5 module parameters:
>
> MODULE_PARM_DESC(smp, "Display all SMPs (default=1)");
> MODULE_PARM_DESC(gmp, "Display all GMPs (default=1)");
> MODULE_PARM_DESC(mgmt_class, "Display all MADs of specified class
> (default=0)");
> MODULE_PARM_DESC(attr_id, "Display add MADs of specified attribute
> ID (default=0)");
> MODULE_PARM_DESC(data, "Display data area of MADs (default=0)");
>
> Given your symptoms, the default settings should be fine except I would
> change the data one to 1 so the data is displayed. I doubt the node is
> even seeing the incoming SMPs for some unknown reason.
>
> So:
> /sbin/modprobe ib_madeye data=1
>
> We may narrow it down from there. You can see the output in
> /var/log/messages or with dmesg.
I installed the module on both of the EM64T systems and checked for
packets with dmesg.
On the "good" system ("koa"), it shows lots of packets where the SM is
trying to survey the network. On the "bad" system ("jatoba"), no
packets at all are captured. It seems that nothing is being received on
the jatoba side.
> We may have gone through this (I don't remember) but can you try:
> 1. Is the firmware version on the node which is not working, the same as
> the one which does ?
The firmware version on both MT25204 cards is reported as 1.0.800 by the
ibstat command.
> 2. Are you sure the cable is plugged in properly ? Do you have another
> cable to try ?
We have tried switching cables, cabling the HCAs to a switch instead of
back-to-back, and this morning we tried swapping the HCA cards between
the two machines. The problem stays on the "jatoba" machine. There is
a difference in the hardware location on the two machines, in that the PCI
bus configuration is different. The "koa" side has the HCA on PCI "
06:00.0", and the "jatoba" side has the HCA on "03:00.0".
> 3. Can you reverse the SM and non SM roles and see how this behaves ?
I brought down the SM on the "koa" side and started it on the "jatoba"
side with script "/etc/init.d/opensmd start". The dmesg output shows
that madeye captured exactly 10 packets, then no more, even after many
minutes. I have attached a file with the captured packets to this email.
When I try to stop SM on jatoba with the "/etc/init.d/opensmd stop"
script, the script hangs (keeps printing out dots and never terminates)
until I break out of it. The OpenSM process stays hung in execution.
Doing a "cat /proc/<pid>/wchan" shows the OpenSM process waiting in "
ib_unregister_mad_agent".
If I try to do other tests on "jatoba" like "ibdiagnet" they also hang.
The only thing in /var/log/osm.log is:
Jun 21 08:48:47 345379 [66A1FCA0] -> OpenSM Rev:openib-1.2.1 OpenIB svn
Exported revision
Jun 21 08:48:47 351128 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn
Exported revision
Jun 21 09:45:53 464987 [18E18CA0] -> OpenSM Rev:openib-1.2.1 OpenIB svn
Exported revision
Jun 21 09:45:53 472677 [0000] -> OpenSM Rev:openib-1.2.1 OpenIB svn
Exported revision
Jun 21 09:45:53 489398 [18E18CA0] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0000 GID:0xfe8000000000000
0,0x0000000000000000
Jun 21 09:45:53 489461 [18E18CA0] -> osm_report_notice: Reporting Generic
Notice type:3 num:66 from LID:0x0000 GID:0xfe8000000000000
0,0x0000000000000000
Jun 21 09:45:53 491919 [18E18CA0] -> osm_vendor_bind: Binding to port
0x2c90200216dc5
Jun 21 09:45:53 493583 [18E18CA0] -> osm_vendor_bind: Binding to port
0x2c90200216dc5
Jun 21 12:21:34 420066 [0000] -> Exiting SM
Thanks for taking a look at this.
-Don Albert-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20060621/22ac0ab4/attachment.html>
More information about the ewg
mailing list