[ofa-general] Limited number of multicasts groups that can be joined?
Andrew Friedley
afriedle at open-mpi.org
Thu Jun 28 08:46:30 PDT 2007
Some updates on this problem.
The code I'm using to test/produce this behavior is an MPI program. MPI
is used for convenience of job startup and collection of results. The
actual test/benchmark is using straight RDMA CM & ibverbs. What I'm
doing is timing how long it takes to join and bring up a multicast group
with varying number of processes and existing groups. One rank joins
with a '0' address to get a real address, MPI_Bcast's that address to
the other ranks, which then join the group. Meanwhile the root rank is
repeatedly sending a small ping message to the group. Every other rank
times from when they call rdma_join_multicast() to the join event
arrival, and to when they first receive a message on that group. Once
completed, the process repeats N times, leaving all the groups joined.
I'm now running OFED v1.2, and behavior has not changed due to this,
though I've noticed some other cases. First -- If I have not been using
anything multicast on the network for a while, I'm able to join a total
of 4 groups with my benchmark. After this, running it any number of
times, I can join 14 groups as described below.
Now the more interesting part. I'm now able to run on a 128 node
machine using open SM running on a node (before, I was running on an 8
node machine which I'm told is running the Cisco SM on a Topspin
switch). On this machine, if I run my benchmark with two processes per
node (instead of one, i.e. mpirun -np 16 with 8 nodes), I'm able to join
> 750 groups simultaneously from one QP on each process. To make this
stranger, I can join only 4 groups running the same thing on the 8-node
machine.
While doing so I noticed that the time from calling
rdma_join_multicast() to the event arrival stayed fairly constant (in
the .001sec range), while the time from the join call to actually
receiving messages on the group steadily increased from around .1 secs
to around 2.7 secs with 750+ groups. Furthermore, this time does not
drop back to .1 secs if I stop the benchmark and run it (or any of my
other multicast code) again. This is understandable within a single
program run, but the fact that behavior persists across runs concerns me
-- feels like a bug, but I don't have much concrete here.
Sorry for the long email -- I'm trying to provide as much detail as
possible so this can get fixed. I'm really not sure where to start
looking on my own, so even some hints on where the problem(s) might lie
would be useful.
Andrew
Andrew Friedley wrote:
> I've run into a problem where it appears that I cannot join more than 14
> multicast groups from a single HCA. I'm using the RDMA CM UD/multicast
> interface from an OFED v1.2 nightly build, and using a '0' address when
> joining to have the SM allocate an unused address. The first 14
> rdma_join_multicast() calls succeed, a MULTICAST_JOIN event comes
> through for each of them and everything works. But the 15th call to
> rdma_join_multicast() returns -1 and sets errno to 99, 'Cannot assign
> requested address'.
>
> Note that I'm using a single QP per process to do all the joins. Things
> get weirder if I run two instances of my program on the same node -- as
> soon the total between the two instances is 14, neither instance can
> join any more groups. Also, right now my code hangs when this happens
> -- if I kill off one of the two instances and run a third instance
> (while leaving the other hung, holding some number of groups), the third
> instance is not able to join ANY groups. The behavior resets when I
> kill all instances.
>
> Two instances running on separate nodes (on the same network) do not
> appear to interfere with each other like described above; they do still
> error out on the 15th join.
>
> This feels like a bug to me; though regardless this limit is WAY too
> low. Any ideas what might be going on, or how I can work around it?
>
> Andrew
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://openib.org/mailman/listinfo/openib-general
More information about the general
mailing list