[openib-general] IPoIB Multicast Connectivity

Hal Rosenstock halr at voltaire.com
Fri Sep 2 13:52:54 PDT 2005


Hi Sean,

Here's my (somewhat long winded) analysis of your osm.log:

First I see:
Sep 02 13:46:34 [AB43F140] -> osm_vendor_bind: Unable to register class 129 version 1. 
Sep 02 13:46:34 [AB43F140] -> osm_vendor_bind: ] 
Sep 02 13:46:34 [AB43F140] -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind() failed.
Sep 02 13:46:34 [AB43F140] -> osm_sm_mad_ctrl_bind: ]
Sep 02 13:46:34 [AB43F140] -> osm_sm_bind: ERR 2E10: SM MAD Controller bind() failed (IB_ERROR).
and then OpenSM shuts down and is restarted 4 minutes later.

It does that again and then it is up and running.

Class 129 is 0x81 which is SubnGet. Was the ib_umad module running ?

What OpenIB svn version are you running ? What Linux kernel version ?

In terms of failures, I then see a join failure on 224.0.0.22
                                MGID....................0xff12401bffff0000 : 0x000000000000016
                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
Sep 02 14:01:56 [42FFF960] -> osm_mcmr_rcv_join_mgrp: ERR 1B11: method = SubnAdmSet,scope_state = 0x1, component mask = 0x0000000000010083, expected comp mask = 0x00000000000130c7.

That is repeated a number of times from this port and some other ports.
                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
                                PortGid.................0xfe80000000000000 : 0x0005ad0000047a81

That may be OK as 224.0.0.22 is for IGMP and perhaps there are no IPmc routers on this
IPoIB subnet ? All the IPmc is subnet local, right ?


In terms of MC groups, I do see the IPv4 broadcast group being setup 
                                MGID....................0xff12401bffff0000 : 0x00000000ffffffff
                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
                                Mlid....................0xC000
and others too:

                                MGID....................0xff12401bffff0000 : 0x0000000000000001
                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
                                Mlid....................0xC001

                                MGID....................0xff12401bffff0000 : 0x00000000000000fb
                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
                                Mlid....................0xC002

                                MGID....................0xff12601bffff0000 : 0x00000001ff03d269
                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
                                Mlid....................0xC003

                                MGID....................0xff12601bffff0000 : 0x0000000000000001
                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
                                Mlid....................0xC004

I then see the next node come up:

                                MGID....................0xff12401bffff0000 : 0x00000000ffffffff
                                PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
                                Mlid....................0xC000

                                MGID....................0xff12401bffff0000 : 0x0000000000000001
                                PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
                                Mlid....................0xC001

and then the next one:

                                MGID....................0xff12401bffff0000 : 0x00000000ffffffff
                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
                                Mlid....................0xC000
                                MGID....................0xff12401bffff0000 : 0x00000000ffffffff
                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
                                Mlid....................0xC000

Perhaps the response doesn't make it back so the end node rerequested this.

                                MGID....................0xff12401bffff0000 : 0x0000000000000001
                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
                                Mlid....................0xC001
                                MGID....................0xff12401bffff0000 : 0x0000000000000001
                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
                                Mlid....................0xC001
                                MGID....................0xff12401bffff0000 : 0x0000000000000001
                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
				Mlid....................0xC001
                                MGID....................0xff12401bffff0000 : 0x0000000000000001
                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
                                Mlid....................0xC001
                                MGID....................0xff12401bffff0000 : 0x0000000000000001
                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
                                Mlid....................0xC001
                                MGID....................0xff12401bffff0000 : 0x0000000000000001
                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
                                Mlid....................0xC001
                                MGID....................0xff12401bffff0000 : 0x0000000000000001
                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
                                Mlid....................0xC001
...

Same thing (but worse) on this group...

Perhaps there is some problem with the HCA or the path to that HCA.

I then see that node rerequest the broadcast group and then 224.0.0.1.
Was it rebooted ? That node seem to be rerequesting quite a number of
times.

I think you are also a candidate to try out the new OpenSM when it is
available (I expect early next week) as the multicast handling by the
SM is much better. I'll be curious to see if this still occurs or not.

This is not to say their might not be other issues but these would be
the first ones to get squared away.

I'm not exactly sure what the SA client retry strategy is in IPoIB in
the end node but that may be germane to this as well.

I also see several of your IPmc addresses flow by in the log:

                                MGID....................0xff12401bffff0000 : 0x00000000000a0a15 
                                PortGid.................0xfe80000000000000 : 0x0005ad000003cfb9
                                Mlid....................0xC005

                                MGID....................0xff12401bffff0000 : 0x00000000000a0a15
                                PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
                                Mlid....................0xC007

                                MGID....................0xff12401bffff0000 : 0x00000000000a0a15
                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
                                Mlid....................0xC007

That looks like the SM set up a different MLID for the same group (1
port on one MLID and 2 other ports on the second MLID).

                                MGID....................0xff12401bffff0000 : 0x00000000000a0a0a
                                PortGid.................0xfe80000000000000 : 0x0005ad0000047a81
                                Mlid....................0xC009

                                MGID....................0xff12401bffff0000 : 0x00000000000a0a0a
                                PortGid.................0xfe80000000000000 : 0x0005ad000003d269
                                Mlid....................0xC009

That one repeats a bunch of times. Several weirdnesses that need some
further investigation.

-- Hal





More information about the general mailing list