[ofa-general] Limited number of multicasts groups that can be joined?
Andrew Friedley
afriedle at open-mpi.org
Fri Jun 8 11:05:33 PDT 2007
I've run into a problem where it appears that I cannot join more than 14
multicast groups from a single HCA. I'm using the RDMA CM UD/multicast
interface from an OFED v1.2 nightly build, and using a '0' address when
joining to have the SM allocate an unused address. The first 14
rdma_join_multicast() calls succeed, a MULTICAST_JOIN event comes
through for each of them and everything works. But the 15th call to
rdma_join_multicast() returns -1 and sets errno to 99, 'Cannot assign
requested address'.
Note that I'm using a single QP per process to do all the joins. Things
get weirder if I run two instances of my program on the same node -- as
soon the total between the two instances is 14, neither instance can
join any more groups. Also, right now my code hangs when this happens
-- if I kill off one of the two instances and run a third instance
(while leaving the other hung, holding some number of groups), the third
instance is not able to join ANY groups. The behavior resets when I
kill all instances.
Two instances running on separate nodes (on the same network) do not
appear to interfere with each other like described above; they do still
error out on the 15th join.
This feels like a bug to me; though regardless this limit is WAY too
low. Any ideas what might be going on, or how I can work around it?
Andrew
More information about the general
mailing list