[PATCH] opensm: Special Case the IPv6 Solicited Node Multicast address to use a single Mcast (WAS: Re: [ofa-general] IPoIB, OFED 1.2.5, and multicast groups.)

Hal Rosenstock hrosenstock at xsigo.com
Mon Jan 14 16:05:00 PST 2008


On Mon, 2008-01-14 at 15:35 -0800, Ira Weiny wrote:
> On Mon, 14 Jan 2008 12:23:34 -0800
> Hal Rosenstock <hrosenstock at xsigo.com> wrote:
> 
> > On Mon, 2008-01-14 at 10:51 -0800, Ira Weiny wrote:
> > > Hey Hal, thanks for the response.  Comments below.
> > > 
> > > On Mon, 14 Jan 2008 12:57:45 -0500
> > > "Hal Rosenstock" <hal.rosenstock at gmail.com> wrote:
> > > 
> > > > Hi Ira,
> > > > 
> > > > On 1/12/08, Ira Weiny <weiny2 at llnl.gov> wrote:
> > > > > And to further answer my question...[*]
> > > > >
> > > > > This seems to fix the problem for us, however I know that it could be better.
> > > > > For example it only takes care of partition 0xFFFF, and I think Jason's idea of
> > > > > having say 16 Mcast Groups and some hash of these into them would be nice.  But
> > > > > is this on the right track?  Am I missing some other place in the code?
> > > > 
> > > > This is a start.
> > > > 
> > > > Some initial comments on a quick scan of the approach used:
> > > > 
> > > > This assumes a homogeneous subnet (in terms of rates and MTUs). I
> > > > think that only groups which share the same rate and MTU can share the
> > > > same MLID.
> > > 
> > > Ah indeed this might be an issue.  This might not be the best place for the
> > > code.  :-(
> > > 
> > > > 
> > > > Also, MLIDs will now need to be use counted and only removed when all
> > > > the groups sharing that MLID are removed.
> > > 
> > > I don't quite understand what you mean here.  There is still a 1:1 mapping of
> > > MLID's to MGID's. 
> > 
> > Didn't you just change that in that many MGIDs go to one MLID ?
> 
> Ah, this is where the confusion has been.  No, this is _not_ what I did...  I
> see now; that is what was proposed in the thread a year ago, however, I don't
> think mapping many MGIDs to 1 MLID will work well.

Why not ?

It appears to be what you did (multiple MGIDs are mapped onto MLID (in
the case below 0xc002)). Am I mistaken ?

> What I did was to allow the first IPv6 request to create the group and then all
> other requests were added to this group.

You are using the word group loosely here and that is the source of the
confusion IMO. I think by group you mean MLID.

>   This sends all the neighbor discovery messages to all nodes on the network.

All nodes part of that MLID tree.

>   This might seem inefficient but should work.  (... and seems to.)

Sure; the hosts will filter based on MGID. The tradeoff is MLID
utilization versus fabric utilization.

> > >  All of the requests for this type of MGRP join are routed to
> > > one group.  Therefore, I thought the same rules for deleting the group would
> > > apply; when all the members are gone it is removed?
> > 
> > Yes, the group may go but not the underlying MLID as there are other
> > groups which are sharing this. That's not what happens now.
> 
> No, since there is only 1 group in this implementation it should work like
> others.  The first node of this "mgid type" will create the group.  Others will
> join it and will continue to use it even if the creator leaves.

Are you saying all these groups appear as 1 "group" to OpenSM (as the
real groups are masked to the same value) ?

-- Hal

> Does this make more sense?
> 
> Ira
> 
> > 
> > >   Just to be clear, after
> > > this patch the mgroups are:
> > > 
> > > 09:36:40 > saquery -g
> > > MCMemberRecord group dump:
> > >                 MGID....................0xff12401bffff0000 : 0x00000000ffffffff
> > >                 Mlid....................0xC000
> > >                 Mtu.....................0x84
> > >                 pkey....................0xFFFF
> > >                 Rate....................0x83
> > > MCMemberRecord group dump:
> > >                 MGID....................0xff12401bffff0000 : 0x0000000000000001
> > >                 Mlid....................0xC001
> > >                 Mtu.....................0x84
> > >                 pkey....................0xFFFF
> > >                 Rate....................0x83
> > > MCMemberRecord group dump:
> > >                 MGID....................0xff12601bffff0000 : 0x00000001ff0021e9
> > >                 Mlid....................0xC002
> > >                 Mtu.....................0x84
> > >                 pkey....................0xFFFF
> > >                 Rate....................0x83
> > > MCMemberRecord group dump:
> > >                 MGID....................0xff12601bffff0000 : 0x0000000000000001
> > >                 Mlid....................0xC003
> > >                 Mtu.....................0x84
> > >                 pkey....................0xFFFF
> > >                 Rate....................0x83
> > > 
> > > All of these requests are added to the
> > >    MGID....................0xff12601bffff0000 : 0x00000001ff0021e9
> > >    Mlid....................0xC002
> > > group.  But as you say, how do we determine that the pkey, mtu, and rate are
> > > valid?  :-/
> > > 
> > > But here is a question:
> > > 
> > > What happens if someone with an incorrect MTU tries to join the
> > >    MGID....................0xff12401bffff0000 : 0x0000000000000001
> > > group?  Wouldn't this code return this mgrp pointer and the subsequent MTU and
> > > rate checks fail?  I seem to recall a thread discussing this before.  I don't
> > > remember what the outcome was.  I seem to remember the question was if OpenSM
> > > should create/modify a group to the "lowest common" MTU/Rate, and succeed all
> > > the joins, vs enforcing the faster MTU/Rate and failing the joins.
> > 
> > Yes, the join would fail, but I don't think that's what we would want.
> > The alternative with the patch is to make it the lowest rate but there
> > is a minimum MTU which might not be right.
> > 
> > > > I think this is a policy and rather than this always being the case,
> > > > there should be a policy parameter added to OpenSM for this. IMO
> > > > default should be to not do this.
> > > 
> > > Yes, for sure there needs to be some options to control the behavior.
> > > 
> > > > 
> > > > Maybe more later...
> > > 
> > > Thanks again,
> > > Ira
> > > 
> > > > 
> > > > -- Hal
> > > > 
> > > > > Thanks,
> > > > > Ira
> > > > >
> > > > > [*] Again I apologize for the spam but we were in a bit of a panic as we only
> > > > > have the big system for the weekend and IB was not part of the test...  ;-)
> > > > >
> > > > > >From 35e35a9534bd49147886ac93ab1601acadcdbe26 Mon Sep 17 00:00:00 2001
> > > > > From: Ira K. Weiny <weiny2 at llnl.gov>
> > > > > Date: Fri, 11 Jan 2008 22:58:19 -0800
> > > > > Subject: [PATCH] Special Case the IPv6 Solicited Node Multicast address to use a single Mcast
> > > > > Group.
> > > > >
> > > > > Signed-off-by: root <weiny2 at llnl.gov>
> > > > > ---
> > > > >  opensm/opensm/osm_sa_mcmember_record.c |   30 +++++++++++++++++++++++++++++-
> > > > >  opensm/opensm/osm_sa_path_record.c     |   31 ++++++++++++++++++++++++++++++-
> > > > >  2 files changed, 59 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/opensm/opensm/osm_sa_mcmember_record.c b/opensm/opensm/osm_sa_mcmember_record.c
> > > > > index 8eb97ad..6bcc124 100644
> > > > > --- a/opensm/opensm/osm_sa_mcmember_record.c
> > > > > +++ b/opensm/opensm/osm_sa_mcmember_record.c
> > > > > @@ -124,9 +124,37 @@ __search_mgrp_by_mgid(IN cl_map_item_t * const p_map_item, IN void *context)
> > > > >        /* compare entire MGID so different scope will not sneak in for
> > > > >           the same MGID */
> > > > >        if (memcmp(&p_mgrp->mcmember_rec.mgid,
> > > > > -                  &p_recvd_mcmember_rec->mgid, sizeof(ib_gid_t)))
> > > > > +                  &p_recvd_mcmember_rec->mgid, sizeof(ib_gid_t))) {
> > > > > +
> > > > > +               /* Special Case IPV6 Multicast Loopback addresses */
> > > > > +               /* 0xff12601bffff0000 : 0x00000001ffXXXXXX */
> > > > > +#define SPEC_PREFIX (0xff12601bffff0000)
> > > > > +#define INT_ID_MASK (0x00000001ff000000)
> > > > > +               uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix);
> > > > > +               uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id);
> > > > > +               uint64_t rcv_prefix = cl_ntoh64(p_recvd_mcmember_rec->mgid.unicast.prefix);
> > > > > +               uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mcmember_rec->mgid.unicast.interface_id);
> > > > > +
> > > > > +               if (rcv_prefix == SPEC_PREFIX
> > > > > +                       &&
> > > > > +                       (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK) {
> > > > > +
> > > > > +                       if ((g_prefix == rcv_prefix)
> > > > > +                               &&
> > > > > +                               (g_interface_id & INT_ID_MASK) ==
> > > > > +                                       (rcv_interface_id & INT_ID_MASK)
> > > > > +                               ) {
> > > > > +                               osm_log(sa->p_log, OSM_LOG_INFO,
> > > > > +                                       "Special Case Mcast Join for MGID "
> > > > > +                                       " MGID 0x%016"PRIx64" : 0x%016"PRIx64"\n",
> > > > > +                                       rcv_prefix, rcv_interface_id);
> > > > > +                               goto match;
> > > > > +                       }
> > > > > +               }
> > > > >                return;
> > > > > +       }
> > > > >
> > > > > +match:
> > > > >        if (p_ctxt->p_mgrp) {
> > > > >                osm_log(sa->p_log, OSM_LOG_ERROR,
> > > > >                        "__search_mgrp_by_mgid: ERR 1B03: "
> > > > > diff --git a/opensm/opensm/osm_sa_path_record.c b/opensm/opensm/osm_sa_path_record.c
> > > > > index 749a936..469773a 100644
> > > > > --- a/opensm/opensm/osm_sa_path_record.c
> > > > > +++ b/opensm/opensm/osm_sa_path_record.c
> > > > > @@ -1536,8 +1536,37 @@ __search_mgrp_by_mgid(IN cl_map_item_t * const p_map_item, IN void *context)
> > > > >
> > > > >        /* compare entire MGID so different scope will not sneak in for
> > > > >           the same MGID */
> > > > > -       if (memcmp(&p_mgrp->mcmember_rec.mgid, p_recvd_mgid, sizeof(ib_gid_t)))
> > > > > +       if (memcmp(&p_mgrp->mcmember_rec.mgid, p_recvd_mgid, sizeof(ib_gid_t))) {
> > > > > +
> > > > > +               /* Special Case IPV6 Multicast Loopback addresses */
> > > > > +               /* 0xff12601bffff0000 : 0x00000001ffXXXXXX */
> > > > > +#define SPEC_PREFIX (0xff12601bffff0000)
> > > > > +#define INT_ID_MASK (0x00000001ff000000)
> > > > > +               uint64_t g_prefix = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.prefix);
> > > > > +               uint64_t g_interface_id = cl_ntoh64(p_mgrp->mcmember_rec.mgid.unicast.interface_id);
> > > > > +               uint64_t rcv_prefix = cl_ntoh64(p_recvd_mgid->unicast.prefix);
> > > > > +               uint64_t rcv_interface_id = cl_ntoh64(p_recvd_mgid->unicast.interface_id);
> > > > > +
> > > > > +               if (rcv_prefix == SPEC_PREFIX
> > > > > +                       &&
> > > > > +                       (rcv_interface_id & INT_ID_MASK) == INT_ID_MASK) {
> > > > > +
> > > > > +                       if ((g_prefix == rcv_prefix)
> > > > > +                               &&
> > > > > +                               (g_interface_id & INT_ID_MASK) ==
> > > > > +                                       (rcv_interface_id & INT_ID_MASK)
> > > > > +                               ) {
> > > > > +                               osm_log(sa->p_log, OSM_LOG_INFO,
> > > > > +                                       "Special Case Mcast Join for MGID "
> > > > > +                                       " MGID 0x%016"PRIx64" : 0x%016"PRIx64"\n",
> > > > > +                                       rcv_prefix, rcv_interface_id);
> > > > > +                               goto match;
> > > > > +                       }
> > > > > +               }
> > > > >                return;
> > > > > +       }
> > > > > +
> > > > > +match:
> > > > >
> > > > >  #if 0
> > > > >        for (i = 0;
> > > > > --
> > > > > 1.5.1
> > > > >
> > > > >
> > > > >
> > > > > On Fri, 11 Jan 2008 22:04:56 -0800
> > > > > Ira Weiny <weiny2 at llnl.gov> wrote:
> > > > >
> > > > > > Ok,
> > > > > >
> > > > > > I found my own answer.  Sorry for the spam.
> > > > > >
> > > > > > http://lists.openfabrics.org/pipermail/general/2006-November/029617.html
> > > > > >
> > > > > > Sorry,
> > > > > > Ira
> > > > > >
> > > > > >
> > > > > > On Fri, 11 Jan 2008 19:36:57 -0800
> > > > > > Ira Weiny <weiny2 at llnl.gov> wrote:
> > > > > >
> > > > > > > I don't really understand the innerworkings of IPoIB so forgive me if this is a
> > > > > > > really stupid question but:
> > > > > > >
> > > > > > >    Is it a bug that there is a Multicast group created for every node in our
> > > > > > >    clusters?
> > > > > > >
> > > > > > > If not a bug why is this done?  We just tried to boot on a 1151 node cluster
> > > > > > > and opensm is complaining there are not enough multicast groups.
> > > > > > >
> > > > > > >    Jan 11 18:30:42 728984 [40C05960] -> __get_new_mlid: ERR 1B23: All available:1024 mlids are taken
> > > > > > >    Jan 11 18:30:42 729050 [40C05960] -> osm_mcmr_rcv_create_new_mgrp: ERR 1B19: __get_new_mlid failed
> > > > > > >    Jan 11 18:30:42 730647 [40401960] -> __get_new_mlid: ERR 1B23: All available:1024 mlids are taken
> > > > > > >    Jan 11 18:30:42 730691 [40401960] -> osm_mcmr_rcv_create_new_mgrp: ERR 1B19: __get_new_mlid failed
> > > > > > >
> > > > > > >
> > > > > > > Here is the output from my small test cluster:  (ibnodesinmcast uses saquery a
> > > > > > > couple of times to print this nice report.)
> > > > > > >
> > > > > > >
> > > > > > >    19:17:24 > whatsup
> > > > > > >    up:   9: wopr[0-7],wopri
> > > > > > >    down: 0:
> > > > > > >    root at wopri:/tftpboot/images
> > > > > > >    19:25:03 > ibnodesinmcast -g
> > > > > > >    0xC000 (0xff12401bffff0000 : 0x00000000ffffffff)
> > > > > > >       In  9: wopr[0-7],wopri
> > > > > > >       Out 0: 0
> > > > > > >    0xC001 (0xff12401bffff0000 : 0x0000000000000001)
> > > > > > >       In  9: wopr[0-7],wopri
> > > > > > >       Out 0: 0
> > > > > > >    0xC002 (0xff12601bffff0000 : 0x00000001ff2265ed)
> > > > > > >       In  1: wopr3
> > > > > > >       Out 8: wopr[0-2,4-7],wopri
> > > > > > >    0xC003 (0xff12601bffff0000 : 0x0000000000000001)
> > > > > > >       In  9: wopr[0-7],wopri
> > > > > > >       Out 0: 0
> > > > > > >    0xC004 (0xff12601bffff0000 : 0x00000001ff222729)
> > > > > > >       In  1: wopr4
> > > > > > >       Out 8: wopr[0-3,5-7],wopri
> > > > > > >    0xC005 (0xff12601bffff0000 : 0x00000001ff219e65)
> > > > > > >       In  1: wopri
> > > > > > >       Out 8: wopr[0-7]
> > > > > > >    0xC006 (0xff12601bffff0000 : 0x00000001ff00232d)
> > > > > > >       In  1: wopr6
> > > > > > >       Out 8: wopr[0-5,7],wopri
> > > > > > >    0xC007 (0xff12601bffff0000 : 0x00000001ff002325)
> > > > > > >       In  1: wopr7
> > > > > > >       Out 8: wopr[0-6],wopri
> > > > > > >    0xC008 (0xff12601bffff0000 : 0x00000001ff228d35)
> > > > > > >       In  1: wopr1
> > > > > > >       Out 8: wopr[0,2-7],wopri
> > > > > > >    0xC009 (0xff12601bffff0000 : 0x00000001ff2227f1)
> > > > > > >       In  1: wopr2
> > > > > > >       Out 8: wopr[0-1,3-7],wopri
> > > > > > >    0xC00A (0xff12601bffff0000 : 0x00000001ff219ef1)
> > > > > > >       In  1: wopr0
> > > > > > >       Out 8: wopr[1-7],wopri
> > > > > > >    0xC00B (0xff12601bffff0000 : 0x00000001ff0021e9)
> > > > > > >       In  1: wopr5
> > > > > > >       Out 8: wopr[0-4,6-7],wopri
> > > > > > >
> > > > > > >
> > > > > > > Each of these MGIDS of the prefix (0xff12601bffff0000) have just one node in
> > > > > > > them and represent an ipv6 address.  Could you turn off ipv6 with the latest
> > > > > > > IPoIB?
> > > > > > >
> > > > > > > In a bind,
> > > > > > > Ira
> > > > > > > _______________________________________________
> > > > > > > general mailing list
> > > > > > > general at lists.openfabrics.org
> > > > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > > > >
> > > > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > > > >
> > > > > _______________________________________________
> > > > > general mailing list
> > > > > general at lists.openfabrics.org
> > > > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > > >
> > > > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> > > > >
> > > > >
> > > _______________________________________________
> > > general mailing list
> > > general at lists.openfabrics.org
> > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > > 
> > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general



More information about the general mailing list