[ofa-general] Limited number of multicasts groups that can be joined?

Andrew Friedley afriedle at open-mpi.org
Thu Jul 19 11:14:00 PDT 2007



Andrew Friedley wrote:
> Hal Rosenstock wrote:
>> I'm not quite parsing what is the same with what is different in the 
>> results
>> (and I presume the only variable is SM).
> 
> Yes; this is confusing, I'll try to summarize the various behaviors I'm 
> getting.
> 
> First, there are two machines.  One has 8 nodes and runs a Topspin 
> switch with the Cisco SM on it.  The other is 128 nodes and runs a 
> Mellanox switch with Open SM on a compute node.  OFED v1.2 is used on 
> both.  Below is how many groups I can join using my test program 
> (described elsewhere in the thread)
> 
> On the 8 node machine:
> 8 procs (one per node) -- 14 groups.
> 16 procs (two per node) -- 4 groups.
> 
> On the 128 node machine:
> 8 procs (one per node, 8 nodes used) -- 14 groups.
> 16 procs (two per node, 8 nodes used) -- unlimited? I stopped past 750.
> 
> Some peculiarities complicate this.  On either machine, I've noticed 
> that if I haven't been doing anything using IB multicast in say a day 
> (haven't tried to figure out exactly how long), in any run scenario 
> listed above, I can join 4 groups.  I do a couple runs where I hit 
> errors after 4 groups, and then I consistently get the group counts 
> above for the rest of the work day.
> 
> Second, in the cases in which I am able to join 14 groups, if I run my 
> test program twice simultaneously on the same nodes, I am able to join a 
> maximum of 14 groups total between the two running tests (as opposed to 
> 14 per test run).  Running the test twice simultaneously using a 
> disjoint set of nodes is not an issue.

So I sent that last email before I meant to :)  Need to eat..  I've 
managed to confuse my self a little here too -- it looks like changing 
from the Cisco SM to the OpenSM did not change behavior on the 8 node 
machine.  At least, I'm still getting the same results above now that 
it's back on the Cisco SM.

Also some newer results.  I had a long run going on the 128 node machine 
to see how many groups I really could join, and it just errored out 
after joining 892 groups successfully.  Specifically, I got an 
RDMA_CM_EVENT_MULTICAST_ERROR event containing status -22 ('Unknown 
error' according to sterror).  errno is still cleared to 'Success'.  I 
don't have time go look at the code to see where this came from right 
now, but does anyone know what it means?

Andrew

> 
>>> This makes me think the switch is involved, is this correct?
>>
>>
>> I doubt it. It is either end station, SM, or a combination of the two.
> 
> OK.
> 
>>> OK this makes sense, but I still don't see where all the time is going.
>>>   Should the fact that the switches haven't been reprogrammed since
>>> leaving the groups really effect how long it takes to do a subsequent
>>> join?  I'm not convinced.
>>
>>
>> It takes time for the SM to recalculate the multicast tree. While 
>> leaves can
>> be lazy, I forget whether joins are synchronous or not.
> 
> Is the algorithm for recalculating the tree documented at all?  Or, 
> where is the code for it (assuming I have access)?  I feel like I'm 
> missing something here that explains why it's so costly.
> 
> Andrew
> 
>>
>> Is this time being consumed by the switches when the are asked to
>>> reprogram their tables (I assume some sort of routing table is used
>>> internally)?
>>
>>
>> This is relatively quick compared to the policy for the SM rerouting of
>> multicast based on joins/leaves/group creation/deletion.
> 
> OK.  Thanks for the insight.
> 
> Andrew
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general



More information about the general mailing list