[ofa-general] Limited number of multicasts groups that can be joined?

Thu Jul 19 10:13:15 PDT 2007

Finally was able to have the SM switched over from Cisco on the switch 
to OpenSM on a node.  Responses inline below..

Sean Hefty wrote:
>> Now the more interesting part.  I'm now able to run on a 128 node 
>> machine using open SM running on a node (before, I was running on an 8 
>> node machine which I'm told is running the Cisco SM on a Topspin 
>> switch).  On this machine, if I run my benchmark with two processes 
>> per node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able 
>> to join  > 750 groups simultaneously from one QP on each process.  To 
>> make this stranger, I can join only 4 groups running the same thing on 
>> the 8-node machine.
> 
> Are the switches and HCAs in the two setups the same?  If you run the 
> same SM on both clusters, do you see the same results?

The switches are different.  The 8 node machine uses a Topspin switch, 
the 128 node machine uses a Mellanox switch.  Looking at `ibstat` the 
HCAs appear to be the same (MT23108), though HCAs on the 128 node 
machine have firmware 3.2.0, where 3.5.0 is on the 8 node machine.  Does 
this matter?

Running OpenSM now, I still do not see the same results.  Behavior is 
now the same as the 128 node machine, except when running two processes 
per node (in which case I can join as many groups as I like on the 128 
node machine).  On the 8 node machine I am still limited to 4 groups in 
this case.  This makes me think the switch is involved, is this correct?

> 
>> While doing so I noticed that the time from calling 
>> rdma_join_multicast() to the event arrival stayed fairly constant (in 
>> the .001sec range), while the time from the join call to actually 
>> receiving messages on the group steadily increased from around .1 secs 
>> to around 2.7 secs with 750+ groups.  Furthermore, this time does not 
>> drop back to .1 secs if I stop the benchmark and run it (or any of my 
>> other multicast code) again.  This is understandable within a single 
>> program run, but the fact that behavior persists across runs concerns 
>> me -- feels like a bug, but I don't have much concrete here.
> 
> Even after all nodes leave all multicast groups, I don't believe that 
> there's a requirement for the SA to reprogram the switches immediately. 
>  So if the switches or the configuration of the swtiches are part of the 
> problem, I can imagine seeing issues between runs.
> 
> When rdma_join_multicast() reports the join event, it means either: the 
> SA has been notified of the join request, or, if the port has already 
> joined the group, that a reference count on the group has been 
> incremented.  The SA may still require time to program the switch 
> forwarding tables.

OK this makes sense, but I still don't see where all the time is going. 
  Should the fact that the switches haven't been reprogrammed since 
leaving the groups really effect how long it takes to do a subsequent 
join?  I'm not convinced.

Is this time being consumed by the switches when the are asked to 
reprogram their tables (I assume some sort of routing table is used 
internally)?  What could they be doing that takes so long to do that? 
Is it something that a firmware change on the switch could alleviate?

Andrew