[ofa-general] Issues with combined routing in smpquery

Ira Weiny weiny2 at llnl.gov
Tue Apr 28 20:27:36 PDT 2009


Sasha, Hal,

I have some hardware on which the following query does not work.

   18:40:54 > ./smpquery -c nodeinfo 243 0,1
   ibwarn: [22072] mad_rpc: _do_madrpc failed; dport (Lid 243 DR path slid 148; dlid 65535; 0,1)
   ./smpquery: iberror: failed: operation nodeinfo: node info query failed

from the node I am running on.

   20:08:46 > ibstat
   CA 'mlx4_0'
        CA type: MT25418
        Number of ports: 2
        Firmware version: 2.6.0
        Hardware version: a0
        Node GUID: 0x0002c9020025feb4
        System image GUID: 0x0002c9020025feb7
        Port 1:
                  State: Active
                  Physical state: LinkUp
                  Rate: 10
                  Base lid: 148
                  LMC: 2
                  SM lid: 148
                  Capability mask: 0x0251086a
                  Port GUID: 0x0002c9020025feb5
   [snip]

   19:12:10 > hostname
   hype137


A query on the LID alone returns this.

   18:41:20 > ./smpquery nodeinfo 243 
   # Node info: Lid 243
   [snip]
   NodeType:........................Switch
   NumPorts:........................24
   SystemGuid:......................0x0008f10400400e69
   Guid:............................0x0008f10400400e69
   PortGuid:........................0x0008f10400400e69
   [snip]

And iblinkinfo is.

   18:41:26 > iblinkinfo.pl -S 0x0008f10400400e69
   Switch 0x0008f10400400e69 ISR9288 Voltaire sFB-12D:
      243    1[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     646   10[  ] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
   [snip]


It looks like combined routing is not working at all except for this one
query.  (LID 37 is the switch which is connected to the HCA I am running
on.)

   18:53:18 > ./smpquery -c portinfo 37 0,1
   # Port info: Lid 37 DR path slid 148; dlid 65535; 0,1 port 0
   Mkey:............................0x0000000000000000
   GidPrefix:.......................0xfe80000000000000
   Lid:.............................148
   SMLid:...........................148
   [snip]

All other combined routing queries I try fail.  And even this one above is
wrong.  It is returning the data on port 6 not 1.  Look at the output from the
local switch.

   19:12:00 > iblinkinfo.pl -R -S 0x000b8cffff004663
   Switch 0x000b8cffff004663 MT47396 Infiniscale-III Mellanox Technologies:
      37    1[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     108    1[  ] "hype132" (  )
      37    2[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     528    1[  ] "hype133" (  )
      37    3[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     296    1[  ] "hype134" (  )
      37    4[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>      92    1[  ] "hype135" (  )
      37    5[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     144    1[  ] "hype136" (  )

This is what is connected to LID 148...
      37    6[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     148    1[  ] "hype137" (  )

      37    7[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     540    1[  ] "hype138" (  )
      37    8[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     212    1[  ] "hype139" (  )
      37    9[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     532    1[  ] "hype140" (  )
      37   10[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>      60    1[  ] "hype141" (  )
      37   11[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     192    1[  ] "hype142" (  )
      37   12[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     312    1[  ] "hype143" (  )
      37   13[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     647   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
      37   14[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     641   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
      37   15[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     643   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
      37   16[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     653   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
      37   17[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     637   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
      37   18[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     610   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
      37   19[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     655   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
      37   20[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     645   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
      37   21[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     635   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
      37   22[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     651   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
      37   23[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     639   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )
      37   24[  ]  ==( 4X 2.5 Gbps Active /   LinkUp)==>     649   13[12] "ISR9288/ISR9096 Voltaire sLB-24D" (  )

Any idea what is going on?  These were all run with a smpquery built from the
current master tree.

On my little test system this seems to work just fine...  But not on this
system.  Did some older hardware not support combined DR routing?

Ira

-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
weiny2 at llnl.gov



More information about the general mailing list