[ofa-general] Limited number of multicasts groups that can be joined?

Andrew Friedley afriedle at open-mpi.org
Thu Jul 19 10:58:42 PDT 2007


Hal Rosenstock wrote:
> I'm not quite parsing what is the same with what is different in the 
> results
> (and I presume the only variable is SM).

Yes; this is confusing, I'll try to summarize the various behaviors I'm 
getting.

First, there are two machines.  One has 8 nodes and runs a Topspin 
switch with the Cisco SM on it.  The other is 128 nodes and runs a 
Mellanox switch with Open SM on a compute node.  OFED v1.2 is used on 
both.  Below is how many groups I can join using my test program 
(described elsewhere in the thread)

On the 8 node machine:
8 procs (one per node) -- 14 groups.
16 procs (two per node) -- 4 groups.

On the 128 node machine:
8 procs (one per node, 8 nodes used) -- 14 groups.
16 procs (two per node, 8 nodes used) -- unlimited? I stopped past 750.

Some peculiarities complicate this.  On either machine, I've noticed 
that if I haven't been doing anything using IB multicast in say a day 
(haven't tried to figure out exactly how long), in any run scenario 
listed above, I can join 4 groups.  I do a couple runs where I hit 
errors after 4 groups, and then I consistently get the group counts 
above for the rest of the work day.

Second, in the cases in which I am able to join 14 groups, if I run my 
test program twice simultaneously on the same nodes, I am able to join a 
maximum of 14 groups total between the two running tests (as opposed to 
14 per test run).  Running the test twice simultaneously using a 
disjoint set of nodes is not an issue.

>> This makes me think the switch is involved, is this correct?
> 
> 
> I doubt it. It is either end station, SM, or a combination of the two.

OK.

>> OK this makes sense, but I still don't see where all the time is going.
>>   Should the fact that the switches haven't been reprogrammed since
>> leaving the groups really effect how long it takes to do a subsequent
>> join?  I'm not convinced.
> 
> 
> It takes time for the SM to recalculate the multicast tree. While leaves 
> can
> be lazy, I forget whether joins are synchronous or not.

Is the algorithm for recalculating the tree documented at all?  Or, 
where is the code for it (assuming I have access)?  I feel like I'm 
missing something here that explains why it's so costly.

Andrew

> 
> Is this time being consumed by the switches when the are asked to
>> reprogram their tables (I assume some sort of routing table is used
>> internally)?
> 
> 
> This is relatively quick compared to the policy for the SM rerouting of
> multicast based on joins/leaves/group creation/deletion.

OK.  Thanks for the insight.

Andrew



More information about the general mailing list