[ofa-general] Limited number of multicasts groups that can be joined?

Thu Jun 28 08:46:30 PDT 2007

Some updates on this problem.

The code I'm using to test/produce this behavior is an MPI program.  MPI 
is used for convenience of job startup and collection of results.  The 
actual test/benchmark is using straight RDMA CM & ibverbs.  What I'm 
doing is timing how long it takes to join and bring up a multicast group 
with varying number of processes and existing groups.  One rank joins 
with a '0' address to get a real address, MPI_Bcast's that address to 
the other ranks, which then join the group.  Meanwhile the root rank is 
repeatedly sending a small ping message to the group.  Every other rank 
times from when they call rdma_join_multicast() to the join event 
arrival, and to when they first receive a message on that group.  Once 
completed, the process repeats N times, leaving all the groups joined.

I'm now running OFED v1.2, and behavior has not changed due to this, 
though I've noticed some other cases.  First -- If I have not been using 
anything multicast on the network for a while, I'm able to join a total 
of 4 groups with my benchmark.  After this, running it any number of 
times, I can join 14 groups as described below.

Now the more interesting part.  I'm now able to run on a 128 node 
machine using open SM running on a node (before, I was running on an 8 
node machine which I'm told is running the Cisco SM on a Topspin 
switch).  On this machine, if I run my benchmark with two processes per 
node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able to join 
 > 750 groups simultaneously from one QP on each process.  To make this 
stranger, I can join only 4 groups running the same thing on the 8-node 
machine.

While doing so I noticed that the time from calling 
rdma_join_multicast() to the event arrival stayed fairly constant (in 
the .001sec range), while the time from the join call to actually 
receiving messages on the group steadily increased from around .1 secs 
to around 2.7 secs with 750+ groups.  Furthermore, this time does not 
drop back to .1 secs if I stop the benchmark and run it (or any of my 
other multicast code) again.  This is understandable within a single 
program run, but the fact that behavior persists across runs concerns me 
-- feels like a bug, but I don't have much concrete here.

Sorry for the long email -- I'm trying to provide as much detail as 
possible so this can get fixed.  I'm really not sure where to start 
looking on my own, so even some hints on where the problem(s) might lie 
would be useful.

Andrew

Andrew Friedley wrote:
> I've run into a problem where it appears that I cannot join more than 14 
> multicast groups from a single HCA.  I'm using the RDMA CM UD/multicast 
> interface from an OFED v1.2 nightly build, and using a '0' address when 
> joining to have the SM allocate an unused address.  The first 14 
> rdma_join_multicast() calls succeed, a MULTICAST_JOIN event comes 
> through for each of them and everything works.  But the 15th call to 
> rdma_join_multicast() returns -1 and sets errno to 99, 'Cannot assign 
> requested address'.
> 
> Note that I'm using a single QP per process to do all the joins.  Things 
> get weirder if I run two instances of my program on the same node -- as 
> soon the total between the two instances is 14, neither instance can 
> join any more groups.  Also, right now my code hangs when this happens 
> -- if I kill off one of the two instances and run a third instance 
> (while leaving the other hung, holding some number of groups), the third 
> instance is not able to join ANY groups.  The behavior resets when I 
> kill all instances.
> 
> Two instances running on separate nodes (on the same network) do not 
> appear to interfere with each other like described above; they do still 
> error out on the 15th join.
> 
> This feels like a bug to me; though regardless this limit is WAY too 
> low.  Any ideas what might be going on, or how I can work around it?
> 
> Andrew
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit 
> http://openib.org/mailman/listinfo/openib-general