[openib-general] Re: IPoIB Multicast Connectivity

Sean Hubbell shubbell at dbresearch.net
Fri Sep 2 13:28:25 PDT 2005


Hal Rosenstock wrote:

>Hi Sean,
>
>Here's my (somewhat long winded) analysis of your osm.log:
>
>First I see:
>Sep 02 13:46:34 [AB43F140] -> osm_vendor_bind: Unable to register class 129 version 1. 
>Sep 02 13:46:34 [AB43F140] -> osm_vendor_bind: ] 
>Sep 02 13:46:34 [AB43F140] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind() failed.
>Sep 02 13:46:34 [AB43F140] -> osm_sm_mad_ctrl_bind: ]
>Sep 02 13:46:34 [AB43F140] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind() failed (IB_ERROR).
>and then OpenSM shuts down and is restarted 4 minutes later.
>
>It does that again and then it is up and running.
>
>Class 129 is 0x81 which is SubnGet. Was the ib_umad module running ?
>
>What OpenIB svn version are you running ? What Linux kernel version ?
>
>In terms of failures, I then see a join failure on 224.0.0.22
>                                MGID....................0xff12401bffff0000 : 0x000000000000016
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
>Sep 02 14:01:56 [42FFF960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7.
>
>That is repeated a number of times from this port and some other ports.
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
>                                PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
>
>That may be OK as 224.0.0.22 is for IGMP and perhaps there are no IPmc routers on this
>IPoIB subnet ? All the IPmc is subnet local, right ?
>
>
>In terms of MC groups, I do see the IPv4 broadcast group being setup 
>                                MGID....................0xff12401bffff0000 : 0x00000000ffffffff
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
>                                Mlid....................0xC000
>and others too:
>
>                                MGID....................0xff12401bffff0000 : 0x0000000000000001
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
>                                Mlid....................0xC001
>
>                                MGID....................0xff12401bffff0000 : 0x00000000000000fb
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
>                                Mlid....................0xC002
>
>                                MGID....................0xff12601bffff0000 : 0x00000001ff03d269
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
>                                Mlid....................0xC003
>
>                                MGID....................0xff12601bffff0000 : 0x0000000000000001
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
>                                Mlid....................0xC004
>
>I then see the next node come up:
>
>                                MGID....................0xff12401bffff0000 : 0x00000000ffffffff
>                                PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
>                                Mlid....................0xC000
>
>                                MGID....................0xff12401bffff0000 : 0x0000000000000001
>                                PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
>                                Mlid....................0xC001
>
>and then the next one:
>
>                                MGID....................0xff12401bffff0000 : 0x00000000ffffffff
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
>                                Mlid....................0xC000
>                                MGID....................0xff12401bffff0000 : 0x00000000ffffffff
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
>                                Mlid....................0xC000
>
>Perhaps the response doesn't make it back so the end node rerequested this.
>
>                                MGID....................0xff12401bffff0000 : 0x0000000000000001
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
>                                Mlid....................0xC001
>                                MGID....................0xff12401bffff0000 : 0x0000000000000001
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
>                                Mlid....................0xC001
>                                MGID....................0xff12401bffff0000 : 0x0000000000000001
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
>				Mlid....................0xC001
>                                MGID....................0xff12401bffff0000 : 0x0000000000000001
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
>                                Mlid....................0xC001
>                                MGID....................0xff12401bffff0000 : 0x0000000000000001
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
>                                Mlid....................0xC001
>                                MGID....................0xff12401bffff0000 : 0x0000000000000001
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
>                                Mlid....................0xC001
>                                MGID....................0xff12401bffff0000 : 0x0000000000000001
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
>                                Mlid....................0xC001
>...
>
>Same thing (but worse) on this group...
>
>Perhaps there is some problem with the HCA or the path to that HCA.
>
>I then see that node rerequest the broadcast group and then 224.0.0.1.
>Was it rebooted ? That node seem to be rerequesting quite a number of
>times.
>
>I think you are also a candidate to try out the new OpenSM when it is
>available (I expect early next week) as the multicast handling by the
>SM is much better. I'll be curious to see if this still occurs or not.
>
>This is not to say their might not be other issues but these would be
>the first ones to get squared away.
>
>I'm not exactly sure what the SA client retry strategy is in IPoIB in
>the end node but that may be germane to this as well.
>
>I also see several of your IPmc addresses flow by in the log:
>
>                                MGID....................0xff12401bffff0000 : 0x00000000000a0a15 
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
>                                Mlid....................0xC005
>
>                                MGID....................0xff12401bffff0000 : 0x00000000000a0a15
>                                PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
>                                Mlid....................0xC007
>
>                                MGID....................0xff12401bffff0000 : 0x00000000000a0a15
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
>                                Mlid....................0xC007
>
>That looks like the SM set up a different MLID for the same group (1
>port on one MLID and 2 other ports on the second MLID).
>
>                                MGID....................0xff12401bffff0000 : 0x00000000000a0a0a
>                                PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
>                                Mlid....................0xC009
>
>                                MGID....................0xff12401bffff0000 : 0x00000000000a0a0a
>                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
>                                Mlid....................0xC009
>
>That one repeats a bunch of times. Several weirdnesses that need some
>further investigation.
>
>-- Hal
>
>
>
>  
>
I'll get the new version next week and then look into it. I'll try that 
and let you know the results. If I have problems, I'll send the version 
and we'll at least know what version of openib I have as I cannot find it.

On a side note, I could not ask for better assistance. Thanks Hal.

Sean



More information about the general mailing list