[ofa-general] Limited number of multicasts groups that can be joined?

Thu Jul 19 11:14:12 PDT 2007

Andrew,

On 7/19/07, Andrew Friedley <afriedle at open-mpi.org> wrote:
>
> Hal Rosenstock wrote:
> > I'm not quite parsing what is the same with what is different in the
> > results
> > (and I presume the only variable is SM).
>
> Yes; this is confusing, I'll try to summarize the various behaviors I'm
> getting.
>
> First, there are two machines.  One has 8 nodes and runs a Topspin
> switch with the Cisco SM on it.  The other is 128 nodes and runs a
> Mellanox switch with Open SM on a compute node.  OFED v1.2 is used on
> both.  Below is how many groups I can join using my test program
> (described elsewhere in the thread)
>
> On the 8 node machine:
> 8 procs (one per node) -- 14 groups.
> 16 procs (two per node) -- 4 groups.
>
> On the 128 node machine:
> 8 procs (one per node, 8 nodes used) -- 14 groups.
> 16 procs (two per node, 8 nodes used) -- unlimited? I stopped past 750.
>
> Some peculiarities complicate this.  On either machine, I've noticed
> that if I haven't been doing anything using IB multicast in say a day
> (haven't tried to figure out exactly how long), in any run scenario
> listed above, I can join 4 groups.  I do a couple runs where I hit
> errors after 4 groups, and then I consistently get the group counts
> above for the rest of the work day.
>
> Second, in the cases in which I am able to join 14 groups, if I run my
> test program twice simultaneously on the same nodes, I am able to join a
> maximum of 14 groups total between the two running tests (as opposed to
> 14 per test run).  Running the test twice simultaneously using a
> disjoint set of nodes is not an issue.

Thanks. I can only comment on the OpenSM configuration and in general on SMs
so I'm still not sure what limits you are hitting; it may be multiple but
not sure. Some seemed to be end node (HCA) related based on a previous
email.

>> This makes me think the switch is involved, is this correct?
> >
> >
> > I doubt it. It is either end station, SM, or a combination of the two.
>
> OK.
>
> >> OK this makes sense, but I still don't see where all the time is going.
> >>   Should the fact that the switches haven't been reprogrammed since
> >> leaving the groups really effect how long it takes to do a subsequent
> >> join?  I'm not convinced.
> >
> >
> > It takes time for the SM to recalculate the multicast tree. While leaves
> > can
> > be lazy, I forget whether joins are synchronous or not.
>
> Is the algorithm for recalculating the tree documented at all?  Or,
> where is the code for it (assuming I have access)?  I feel like I'm
> missing something here that explains why it's so costly.

I'm afraid it is just the code AFAIK :-(

-- Hal

Andrew
>
> >
> > Is this time being consumed by the switches when the are asked to
> >> reprogram their tables (I assume some sort of routing table is used
> >> internally)?
> >
> >
> > This is relatively quick compared to the policy for the SM rerouting of
> > multicast based on joins/leaves/group creation/deletion.
>
> OK.  Thanks for the insight.
>
> Andrew
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070719/0630d1cf/attachment.html>