[openib-general] IPoIB still not working

Roland Dreier roland at topspin.com
Tue Dec 7 18:21:42 PST 2004


    Robert> In the failing case, ipoib sends 2 MCM messages that look
    Robert> similar with no errors reported. However, in the failing
    Robert> case ipoib continues to send MCM messages that opensm
    Robert> rejects. In the failing case there are a couple of
    Robert> differences, first the MGID lower 32-bits appear to be
    Robert> 0xffffffff in the passing case and something else when it
    Robert> fails. Second, it appears that perhaps the opensm is
    Robert> rejecting the messages because of a bug where the scope
    Robert> and join fields are reversed when extracted from the
    Robert> mad. In the passing case, since the lower 32 bits of the
    Robert> mgid are 0xfffffffff, you never get to the code that
    Robert> checks the join member. Someone that understands opensm
    Robert> should look at this, but Sean I think it may be wrong.

I think the difference is not 32 bit vs. 64 bit but no IPv6 vs IPv6.

It looks like your 32 bit hosts do not have IPv6 support turned on, so
IPoIB only joins groups with MGIDs starting ff12:401b.  The 64 bit
host does have IPv6 and tries to join its solicited-node group
(messages about ff12:601b:ffff:0:0:1:ffd2:58f1 in mcast-64.log) and
the IPv6 all nodes group (messages about ff12:601b:ffff:0:0:0:0:1 in
osm-64bit.log).  Since no one has created this group yet, OpenSM looks
at the join state field.  As you say, there seems to be a bug in
OpenSM in how it interprets "ScopeState" (JoinState is the low nibble,
and OpenSM dumps the byte as 0x01, so it seems OpenSM is receiving a
correct FullMember request).

The joins of the IPv4 broadcast group (ff12:401b:ffff:0:0:0:ffff:ffff)
and IPv4 all nodes group (ff12:401b:ffff:0:0:0:0:1) succeed because
presumably OpenSM has already created these groups.

    Robert> This however does not explain why in the failing case,
    Robert> ipoib continues to try to join the mcast group unless it
    Robert> is having difficulties after trying yo join he group and
    Robert> decides to re-try, with the subsequent re-tries to join
    Robert> being failed by opensm.

IPoIB is dumb -- when it fails to join a multicast group, it just
keeps trying.

 - R.



More information about the general mailing list