[ofa-general] opensm: bad multicast forwarding table entries
akepner at sgi.com
akepner at sgi.com
Wed Nov 12 14:18:46 PST 2008
Here's a description of a problem we're seeing where multicast
forwarding tables are apparently getting set up incorrectly. I'd
appreciate any debug help from the opensm experts out there.
On large clusters (>1000 nodes or so) we often see hundreds of errors
from 'ibdiagnet -r' like the following (this is the simplest example
I could find):
-I- Multicast Group:0xC069 has:2 switches and:2 HCAs
-E- Disconnected switch:S0800690000002e51/U1 in group:0xC069
-E- Disconnected HCA:r4i2n10/U1
These have invariably been multicast groups associated with IPv6
solicited node multicast addresses, e.g., in this case 'saquery -m'
shows only a single member, "r5lead":
MCMemberRecord member dump:
MGID....................0xff12601bffff0000 : 0x00000001ff26d289
Mlid....................0xC069
PortGid.................0xfe80000000000000 : 0x0002c9020026d289
ScopeState..............0x1
ProxyJoin...............0x0
NodeDescription.........r5lead HCA-1
ibdiagnet shows that "r5lead" is connected to the switch with lid
1609, port 24:
Switch 24 "S-0800690000002db4" # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1609 lmc 0
[24] "H-0002c9020026d288"[1](2c9020026d289) # "r5lead HCA-1" lid 1576 4xDDR
and the multicast forwarding table (from 'dump_mfts.sh') is consistent:
Multicast mlids [0xc000-0xc3ff] of switch Lid 1609 guid 0x0800690000002db4 (MT47396 Infiniscale-III Mellanox Technologies):
0 1 2
Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
MLid
....
0xc069 x
So far, so good. But we also have r4i2n10, connected to the switch with
lid 1533 port 7:
switchguid=0x800690000002e50(800690000002e50)
Switch 24 "S-0800690000002e50" # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1533 lmc 0
......
[7] "H-003048c2438a0000"[1](3048c2438a0001) # "r4i2n10 HCA-1" lid 771 4xDDR
with this mft entry:
Multicast mlids [0xc000-0xc3ff] of switch Lid 1533 guid 0x0800690000002e50 (MT47396 Infiniscale-III Mellanox Technologies):
0 1 2
Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
MLid
.....
0xc069 x
Any idea why "r4i2n10", with PortGid fe80::3048c2438a0001 would have a
mft entry for the multicast group with MGID ff12601bffff::1ff26d289?
Anyone else seen similar?
--
Arthur
More information about the general
mailing list