[openib-general] IPoIB Multicast Connectivity
Hal Rosenstock
halr at voltaire.com
Fri Sep 2 13:52:54 PDT 2005
Hi Sean,
Here's my (somewhat long winded) analysis of your osm.log:
First I see:
Sep 02 13:46:34 [AB43F140] -> osm_vendor_bind: Unable to register class 129 version 1.
Sep 02 13:46:34 [AB43F140] -> osm_vendor_bind: ]
Sep 02 13:46:34 [AB43F140] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind() failed.
Sep 02 13:46:34 [AB43F140] -> osm_sm_mad_ctrl_bind: ]
Sep 02 13:46:34 [AB43F140] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind() failed (IB_ERROR).
and then OpenSM shuts down and is restarted 4 minutes later.
It does that again and then it is up and running.
Class 129 is 0x81 which is SubnGet. Was the ib_umad module running ?
What OpenIB svn version are you running ? What Linux kernel version ?
In terms of failures, I then see a join failure on 224.0.0.22
MGID....................0xff12401bffff0000 : 0x000000000000016
PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
Sep 02 14:01:56 [42FFF960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7.
That is repeated a number of times from this port and some other ports.
PortGid.................0xfe80000000000000 : 0x0005ad000003d269
PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
That may be OK as 224.0.0.22 is for IGMP and perhaps there are no IPmc routers on this
IPoIB subnet ? All the IPmc is subnet local, right ?
In terms of MC groups, I do see the IPv4 broadcast group being setup
MGID....................0xff12401bffff0000 : 0x00000000ffffffff
PortGid.................0xfe80000000000000 : 0x0005ad000003d269
Mlid....................0xC000
and others too:
MGID....................0xff12401bffff0000 : 0x0000000000000001
PortGid.................0xfe80000000000000 : 0x0005ad000003d269
Mlid....................0xC001
MGID....................0xff12401bffff0000 : 0x00000000000000fb
PortGid.................0xfe80000000000000 : 0x0005ad000003d269
Mlid....................0xC002
MGID....................0xff12601bffff0000 : 0x00000001ff03d269
PortGid.................0xfe80000000000000 : 0x0005ad000003d269
Mlid....................0xC003
MGID....................0xff12601bffff0000 : 0x0000000000000001
PortGid.................0xfe80000000000000 : 0x0005ad000003d269
Mlid....................0xC004
I then see the next node come up:
MGID....................0xff12401bffff0000 : 0x00000000ffffffff
PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
Mlid....................0xC000
MGID....................0xff12401bffff0000 : 0x0000000000000001
PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
Mlid....................0xC001
and then the next one:
MGID....................0xff12401bffff0000 : 0x00000000ffffffff
PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
Mlid....................0xC000
MGID....................0xff12401bffff0000 : 0x00000000ffffffff
PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
Mlid....................0xC000
Perhaps the response doesn't make it back so the end node rerequested this.
MGID....................0xff12401bffff0000 : 0x0000000000000001
PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
Mlid....................0xC001
MGID....................0xff12401bffff0000 : 0x0000000000000001
PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
Mlid....................0xC001
MGID....................0xff12401bffff0000 : 0x0000000000000001
PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
Mlid....................0xC001
MGID....................0xff12401bffff0000 : 0x0000000000000001
PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
Mlid....................0xC001
MGID....................0xff12401bffff0000 : 0x0000000000000001
PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
Mlid....................0xC001
MGID....................0xff12401bffff0000 : 0x0000000000000001
PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
Mlid....................0xC001
MGID....................0xff12401bffff0000 : 0x0000000000000001
PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
Mlid....................0xC001
...
Same thing (but worse) on this group...
Perhaps there is some problem with the HCA or the path to that HCA.
I then see that node rerequest the broadcast group and then 224.0.0.1.
Was it rebooted ? That node seem to be rerequesting quite a number of
times.
I think you are also a candidate to try out the new OpenSM when it is
available (I expect early next week) as the multicast handling by the
SM is much better. I'll be curious to see if this still occurs or not.
This is not to say their might not be other issues but these would be
the first ones to get squared away.
I'm not exactly sure what the SA client retry strategy is in IPoIB in
the end node but that may be germane to this as well.
I also see several of your IPmc addresses flow by in the log:
MGID....................0xff12401bffff0000 : 0x00000000000a0a15
PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
Mlid....................0xC005
MGID....................0xff12401bffff0000 : 0x00000000000a0a15
PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
Mlid....................0xC007
MGID....................0xff12401bffff0000 : 0x00000000000a0a15
PortGid.................0xfe80000000000000 : 0x0005ad000003d269
Mlid....................0xC007
That looks like the SM set up a different MLID for the same group (1
port on one MLID and 2 other ports on the second MLID).
MGID....................0xff12401bffff0000 : 0x00000000000a0a0a
PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
Mlid....................0xC009
MGID....................0xff12401bffff0000 : 0x00000000000a0a0a
PortGid.................0xfe80000000000000 : 0x0005ad000003d269
Mlid....................0xC009
That one repeats a bunch of times. Several weirdnesses that need some
further investigation.
-- Hal
More information about the general
mailing list