[ofa-general] Issues with combined routing in smpquery
Ira Weiny
weiny2 at llnl.gov
Tue Apr 28 20:27:36 PDT 2009
Sasha, Hal,
I have some hardware on which the following query does not work.
18:40:54 > ./smpquery -c nodeinfo 243 0,1
ibwarn: [22072] mad_rpc: _do_madrpc failed; dport (Lid 243 DR path slid 148; dlid 65535; 0,1)
./smpquery: iberror: failed: operation nodeinfo: node info query failed
from the node I am running on.
20:08:46 > ibstat
CA 'mlx4_0'
CA type: MT25418
Number of ports: 2
Firmware version: 2.6.0
Hardware version: a0
Node GUID: 0x0002c9020025feb4
System image GUID: 0x0002c9020025feb7
Port 1:
State: Active
Physical state: LinkUp
Rate: 10
Base lid: 148
LMC: 2
SM lid: 148
Capability mask: 0x0251086a
Port GUID: 0x0002c9020025feb5
[snip]
19:12:10 > hostname
hype137
A query on the LID alone returns this.
18:41:20 > ./smpquery nodeinfo 243
# Node info: Lid 243
[snip]
NodeType:........................Switch
NumPorts:........................24
SystemGuid:......................0x0008f10400400e69
Guid:............................0x0008f10400400e69
PortGuid:........................0x0008f10400400e69
[snip]
And iblinkinfo is.
18:41:26 > iblinkinfo.pl -S 0x0008f10400400e69
Switch 0x0008f10400400e69 ISR9288 Voltaire sFB-12D:
243 1[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 646 10[ ] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
[snip]
It looks like combined routing is not working at all except for this one
query. (LID 37 is the switch which is connected to the HCA I am running
on.)
18:53:18 > ./smpquery -c portinfo 37 0,1
# Port info: Lid 37 DR path slid 148; dlid 65535; 0,1 port 0
Mkey:............................0x0000000000000000
GidPrefix:.......................0xfe80000000000000
Lid:.............................148
SMLid:...........................148
[snip]
All other combined routing queries I try fail. And even this one above is
wrong. It is returning the data on port 6 not 1. Look at the output from the
local switch.
19:12:00 > iblinkinfo.pl -R -S 0x000b8cffff004663
Switch 0x000b8cffff004663 MT47396 Infiniscale-III Mellanox Technologies:
37 1[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 108 1[ ] "hype132" ( )
37 2[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 528 1[ ] "hype133" ( )
37 3[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 296 1[ ] "hype134" ( )
37 4[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 92 1[ ] "hype135" ( )
37 5[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 144 1[ ] "hype136" ( )
This is what is connected to LID 148...
37 6[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 148 1[ ] "hype137" ( )
37 7[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 540 1[ ] "hype138" ( )
37 8[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 212 1[ ] "hype139" ( )
37 9[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 532 1[ ] "hype140" ( )
37 10[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 60 1[ ] "hype141" ( )
37 11[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 192 1[ ] "hype142" ( )
37 12[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 312 1[ ] "hype143" ( )
37 13[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 647 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
37 14[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 641 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
37 15[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 643 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
37 16[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 653 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
37 17[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 637 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
37 18[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 610 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
37 19[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 655 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
37 20[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 645 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
37 21[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 635 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
37 22[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 651 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
37 23[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 639 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
37 24[ ] ==( 4X 2.5 Gbps Active / LinkUp)==> 649 13[12] "ISR9288/ISR9096 Voltaire sLB-24D" ( )
Any idea what is going on? These were all run with a smpquery built from the
current master tree.
On my little test system this seems to work just fine... But not on this
system. Did some older hardware not support combined DR routing?
Ira
--
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
weiny2 at llnl.gov
More information about the general
mailing list