[ofa-general] opensm: bad multicast forwarding table entries

akepner at sgi.com akepner at sgi.com
Wed Nov 12 14:18:46 PST 2008


Here's a description of a problem we're seeing where multicast 
forwarding tables are apparently getting set up incorrectly. I'd 
appreciate any debug help from the opensm experts out there.

On large clusters (>1000 nodes or so) we often see hundreds of errors 
from 'ibdiagnet -r' like the following (this is the simplest example 
I could find):

-I- Multicast Group:0xC069 has:2 switches and:2 HCAs
-E- Disconnected switch:S0800690000002e51/U1 in group:0xC069
-E- Disconnected HCA:r4i2n10/U1

These have invariably been multicast groups associated with IPv6 
solicited node multicast addresses, e.g., in this case 'saquery -m' 
shows only a single member, "r5lead":

MCMemberRecord member dump:
                MGID....................0xff12601bffff0000 : 0x00000001ff26d289
                Mlid....................0xC069
                PortGid.................0xfe80000000000000 : 0x0002c9020026d289
                ScopeState..............0x1
                ProxyJoin...............0x0
                NodeDescription.........r5lead HCA-1

ibdiagnet shows that "r5lead" is connected to the switch with lid 
1609, port 24:

Switch  24 "S-0800690000002db4"         # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1609 lmc 0
[24]    "H-0002c9020026d288"[1](2c9020026d289)          # "r5lead HCA-1" lid 1576 4xDDR

and the multicast forwarding table (from 'dump_mfts.sh') is consistent:

Multicast mlids [0xc000-0xc3ff] of switch Lid 1609 guid 0x0800690000002db4 (MT47396 Infiniscale-III Mellanox Technologies):
            0                   1                   2
     Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
 MLid
....
0xc069                                                      x


So far, so good. But we also have r4i2n10, connected to the switch with 
lid 1533 port 7:

switchguid=0x800690000002e50(800690000002e50)
Switch  24 "S-0800690000002e50"         # "MT47396 Infiniscale-III Mellanox Technologies" base port 0 lid 1533 lmc 0
......
[7]     "H-003048c2438a0000"[1](3048c2438a0001)                 # "r4i2n10 HCA-1" lid 771 4xDDR

with this mft entry:

Multicast mlids [0xc000-0xc3ff] of switch Lid 1533 guid 0x0800690000002e50 (MT47396 Infiniscale-III Mellanox Technologies):
            0                   1                   2
     Ports: 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
 MLid
.....
0xc069                    x

Any idea why "r4i2n10", with PortGid fe80::3048c2438a0001 would have a 
mft entry for the multicast group with MGID ff12601bffff::1ff26d289?

Anyone else seen similar?

-- 
Arthur




More information about the general mailing list