[openfabrics-ewg] How do I use "madeye" to diagnose a problem?

Hal Rosenstock halr at voltaire.com
Wed Jun 21 03:26:45 PDT 2006


Hi Don,

On Wed, 2006-06-21 at 00:12, Don.Albert at Bull.com wrote:
> Hal,
> 
> > 
> > On Fri, 2006-05-26 at 20:59, Hal Rosenstock wrote:
> > > > What next, coach?
> > > 
> > > Can you turn on madeye on the remote node and see what packets are
> > > received and sent ? Let me know if you need help with that. I
> think you
> > > said you were running OFED, right ?
> >
> 
> The above text was in an earlier email where you suggested using
> "madeye" to try to dump MAD packets to see what was being received and
> sent on a node where the link goes into an "initializing" state but
> will not go "active".  To summarize the problem:
> 
> For the past several weeks, off and on,  I have been trying to get a
> small two node testbed system to run with the OFED release (first RC5,
> now the 1.0 release).   These nodes are EM64T machines, running an
> RHEL4 U3 Linux with the 2.6.16 kernel.   The HCAs are Mellanox
> MT25204,  4x DDR, connected back to back.
> 
> This back to back setup was working originally with a backported
> 2.6.11-34 kernel and I believe it was revision 6500 from the OpenIB
> svn trunk at that time.  The problems started when I tried to move to
> the OFED release, with the 2.6.16 kernel.   One machine comes up and
> appears to work fine,  but the other will not bring the link up.   The
> one that is working is running the OpenSM Subnet Manager,  and when it
> tries to probe the other system, it gets no response.
> 
> We did try cabling the two systems through a switch to have the SM in
> the switch try to bring up the links, and the "good" system's link
> comes up but the other does not.
> 
> Returning to the suggestion to use madeye:   I located the madeye
> source on the OpenIB svn repository, and I was able to build a kernel
> module,  but I have no information on what the module does, or how to
> use it to capture the MAD packets on the machine with the problem.  
> Can you provide or point me to a description of how to use madeye?

You need to modprobe ib_madeye

The madeye module has 5 module parameters:

MODULE_PARM_DESC(smp, "Display all SMPs (default=1)");
MODULE_PARM_DESC(gmp, "Display all GMPs (default=1)");
MODULE_PARM_DESC(mgmt_class, "Display all MADs of specified class (default=0)");
MODULE_PARM_DESC(attr_id, "Display add MADs of specified attribute ID (default=0)");
MODULE_PARM_DESC(data, "Display data area of MADs (default=0)");

Given your symptoms, the default settings should be fine except I would
change the data one to 1 so the data is displayed. I doubt the node is
even seeing the incoming SMPs for some unknown reason.

So:
/sbin/modprobe ib_madeye data=1

We may narrow it down from there. You can see the output in
/var/log/messages or with dmesg.

We may have gone through this (I don't remember) but can you try:
1. Is the firmware version on the node which is not working, the same as
the one which does ?
2. Are you sure the cable is plugged in properly ? Do you have another
cable to try ?
3. Can you reverse the SM and non SM roles and see how this behaves ?

-- Hal

>         -Don Albert-





More information about the ewg mailing list