[ofa-general] Re: Issues with combined routing in smpquery

Ira Weiny weiny2 at llnl.gov
Wed Apr 29 14:53:55 PDT 2009


I have traced this down a bit more.

The drslid and drdlid have been encoded in the MAD reversed!

This has happened somewhere between version 1.5.0 and 1.5.1.

Applying the following patch allows combined routing to work but I don't know
where the real bug is.

diff --git a/libibmad/src/mad.c b/libibmad/src/mad.c
index 3f04da0..6f34e02 100644
--- a/libibmad/src/mad.c
+++ b/libibmad/src/mad.c
@@ -101,9 +101,9 @@ void *mad_encode(void *buf, ib_rpc_t * rpc, ib_dr_path_t * drpath, void *data)
        if (rpc->mgtclass == IB_SMI_DIRECT_CLASS) {
                /* word 9 */
                mad_set_field(buf, 0, IB_DRSMP_DRDLID_F,
-                             drpath->drdlid ? drpath->drdlid : 0xffff);
-               mad_set_field(buf, 0, IB_DRSMP_DRSLID_F,
                              drpath->drslid ? drpath->drslid : 0xffff);
+               mad_set_field(buf, 0, IB_DRSMP_DRSLID_F,
+                             drpath->drdlid ? drpath->drdlid : 0xffff);
 
                /* bytes 128 - 256 - by default should be zero due to memset */
                if (is_resp)

I don't see any differences between 1.5.0 and 1.5.1 which would cause this.

Any ideas????
Ira

On Tue, 28 Apr 2009 20:55:25 -0700
Ira Weiny <weiny2 at llnl.gov> wrote:

> On Tue, 28 Apr 2009 20:27:36 -0700
> Ira Weiny <weiny2 at llnl.gov> wrote:
> 
> > Sasha, Hal,
> > 
> > I have some hardware on which the following query does not work.
> > 
> >    18:40:54 > ./smpquery -c nodeinfo 243 0,1
> >    ibwarn: [22072] mad_rpc: _do_madrpc failed; dport (Lid 243 DR path slid 148; dlid 65535; 0,1)
> >    ./smpquery: iberror: failed: operation nodeinfo: node info query failed
> > 
> > from the node I am running on.
> > 
> >    20:08:46 > ibstat
> >    CA 'mlx4_0'
> >         CA type: MT25418
> >         Number of ports: 2
> >         Firmware version: 2.6.0
> >         Hardware version: a0
> >         Node GUID: 0x0002c9020025feb4
> >         System image GUID: 0x0002c9020025feb7
> >         Port 1:
> >                   State: Active
> >                   Physical state: LinkUp
> >                   Rate: 10
> >                   Base lid: 148
> >                   LMC: 2
> >                   SM lid: 148
> >                   Capability mask: 0x0251086a
> >                   Port GUID: 0x0002c9020025feb5
> >    [snip]
> > 
> >    19:12:10 > hostname
> >    hype137
> > 
> > 
> > A query on the LID alone returns this.
> > 
> >    18:41:20 > ./smpquery nodeinfo 243 
> >    # Node info: Lid 243
> >    [snip]
> >    NodeType:........................Switch
> >    NumPorts:........................24
> >    SystemGuid:......................0x0008f10400400e69
> >    Guid:............................0x0008f10400400e69
> >    PortGuid:........................0x0008f10400400e69
> >    [snip]
> > 
> > And iblinkinfo is.
> > 
> >    18:41:26 > iblinkinfo.pl -S 0x0008f10400400e69
> >    Switch 0x0008f10400400e69 ISR9288 Voltaire sFB-12D:
> >       243    1[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     646   10[  ] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> >    [snip]
> > 
> > 
> > It looks like combined routing is not working at all except for this one
> > query.  (LID 37 is the switch which is connected to the HCA I am running
> > on.)
> > 
> >    18:53:18 > ./smpquery -c portinfo 37 0,1
> >    # Port info: Lid 37 DR path slid 148; dlid 65535; 0,1 port 0
> >    Mkey:............................0x0000000000000000
> >    GidPrefix:.......................0xfe80000000000000
> >    Lid:.............................148
> >    SMLid:...........................148
> >    [snip]
> > 
> > All other combined routing queries I try fail.  And even this one above is
> > wrong.  It is returning the data on port 6 not 1.  Look at the output from the
> > local switch.
> > 
> >    19:12:00 > iblinkinfo.pl -R -S 0x000b8cffff004663
> >    Switch 0x000b8cffff004663 MT47396 Infiniscale-III Mellanox Technologies:
> >       37    1[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     108    1[  ] "hype132" (  )
> >       37    2[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     528    1[  ] "hype133" (  )
> >       37    3[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     296    1[  ] "hype134" (  )
> >       37    4[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>      92    1[  ] "hype135" (  )
> >       37    5[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     144    1[  ] "hype136" (  )
> > 
> > This is what is connected to LID 148...
> >       37    6[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     148    1[  ] "hype137" (  )
> > 
> >       37    7[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     540    1[  ] "hype138" (  )
> >       37    8[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     212    1[  ] "hype139" (  )
> >       37    9[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     532    1[  ] "hype140" (  )
> >       37   10[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>      60    1[  ] "hype141" (  )
> >       37   11[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     192    1[  ] "hype142" (  )
> >       37   12[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     312    1[  ] "hype143" (  )
> >       37   13[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     647   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> >       37   14[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     641   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> >       37   15[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     643   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> >       37   16[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     653   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> >       37   17[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     637   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> >       37   18[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     610   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> >       37   19[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     655   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> >       37   20[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     645   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> >       37   21[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     635   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> >       37   22[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     651   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> >       37   23[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     639   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> >       37   24[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     649   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
> > 
> > Any idea what is going on?  These were all run with a smpquery built from the
> > current master tree.
> > 
> > On my little test system this seems to work just fine...  But not on this
> > system.  Did some older hardware not support combined DR routing?
> 
> Actually I take this back.  It seems an older version of smpquery works but
> not this newer one.  So I don't think this is a hardware issue.  :-(
> 
>    20:54:47 > ./smpquery -c nodeinfo 14 0,10
>    ibwarn: [21947] _do_madrpc: send failed; Invalid argument
>    ibwarn: [21947] mad_rpc: _do_madrpc failed; dport (Lid 14 DR path slid 4; dlid 65535; 0,10)
>    ./smpquery: iberror: failed: operation nodeinfo: node info query failed
> 
>    20:54:52 > ./smpquery -V
>    ./smpquery BUILD VERSION: 1.5.1_76524e3_dirty Build date: Apr 28 2009 20:47:10
> 
>    20:54:55 > smpquery -c nodeinfo 14 0,10
>    # Node info: Lid 14 DR path 0,10
>    BaseVers:........................1
>    ClassVers:.......................1
>    NodeType:........................Switch
>    NumPorts:........................24
>    SystemGuid:......................0x0008f10400411b19
>    Guid:............................0x0008f10400411b18
>    PortGuid:........................0x0008f10400411b18
>    PartCap:.........................8
>    DevId:...........................0x5a30
>    Revision:........................0x000001a1
>    LocalPort:.......................24
>    VendorId:........................0x0008f1
> 
>    20:54:59 > smpquery -V
>    smpquery BUILD VERSION: 1.3.6 Build date: Oct 13 2008 12:20:42
> 
> Ira
> 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
weiny2 at llnl.gov



More information about the general mailing list