[ofa-general] Limited number of multicasts groups that can be joined?

Thu Jun 28 14:23:02 PDT 2007

> Now the more interesting part.  I'm now able to run on a 128 node 
> machine using open SM running on a node (before, I was running on an 8 
> node machine which I'm told is running the Cisco SM on a Topspin 
> switch).  On this machine, if I run my benchmark with two processes per 
> node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able to join 
>  > 750 groups simultaneously from one QP on each process.  To make this 
> stranger, I can join only 4 groups running the same thing on the 8-node 
> machine.

Are the switches and HCAs in the two setups the same?  If you run the 
same SM on both clusters, do you see the same results?

> While doing so I noticed that the time from calling 
> rdma_join_multicast() to the event arrival stayed fairly constant (in 
> the .001sec range), while the time from the join call to actually 
> receiving messages on the group steadily increased from around .1 secs 
> to around 2.7 secs with 750+ groups.  Furthermore, this time does not 
> drop back to .1 secs if I stop the benchmark and run it (or any of my 
> other multicast code) again.  This is understandable within a single 
> program run, but the fact that behavior persists across runs concerns me 
> -- feels like a bug, but I don't have much concrete here.

Even after all nodes leave all multicast groups, I don't believe that 
there's a requirement for the SA to reprogram the switches immediately. 
  So if the switches or the configuration of the swtiches are part of 
the problem, I can imagine seeing issues between runs.

When rdma_join_multicast() reports the join event, it means either: the 
SA has been notified of the join request, or, if the port has already 
joined the group, that a reference count on the group has been 
incremented.  The SA may still require time to program the switch 
forwarding tables.

- Sean