[openib-general] IPoIB still not working
Roland Dreier
roland at topspin.com
Tue Dec 7 18:21:42 PST 2004
Robert> In the failing case, ipoib sends 2 MCM messages that look
Robert> similar with no errors reported. However, in the failing
Robert> case ipoib continues to send MCM messages that opensm
Robert> rejects. In the failing case there are a couple of
Robert> differences, first the MGID lower 32-bits appear to be
Robert> 0xffffffff in the passing case and something else when it
Robert> fails. Second, it appears that perhaps the opensm is
Robert> rejecting the messages because of a bug where the scope
Robert> and join fields are reversed when extracted from the
Robert> mad. In the passing case, since the lower 32 bits of the
Robert> mgid are 0xfffffffff, you never get to the code that
Robert> checks the join member. Someone that understands opensm
Robert> should look at this, but Sean I think it may be wrong.
I think the difference is not 32 bit vs. 64 bit but no IPv6 vs IPv6.
It looks like your 32 bit hosts do not have IPv6 support turned on, so
IPoIB only joins groups with MGIDs starting ff12:401b. The 64 bit
host does have IPv6 and tries to join its solicited-node group
(messages about ff12:601b:ffff:0:0:1:ffd2:58f1 in mcast-64.log) and
the IPv6 all nodes group (messages about ff12:601b:ffff:0:0:0:0:1 in
osm-64bit.log). Since no one has created this group yet, OpenSM looks
at the join state field. As you say, there seems to be a bug in
OpenSM in how it interprets "ScopeState" (JoinState is the low nibble,
and OpenSM dumps the byte as 0x01, so it seems OpenSM is receiving a
correct FullMember request).
The joins of the IPv4 broadcast group (ff12:401b:ffff:0:0:0:ffff:ffff)
and IPv4 all nodes group (ff12:401b:ffff:0:0:0:0:1) succeed because
presumably OpenSM has already created these groups.
Robert> This however does not explain why in the failing case,
Robert> ipoib continues to try to join the mcast group unless it
Robert> is having difficulties after trying yo join he group and
Robert> decides to re-try, with the subsequent re-tries to join
Robert> being failed by opensm.
IPoIB is dumb -- when it fails to join a multicast group, it just
keeps trying.
- R.
More information about the general
mailing list