[ofa-general] multicast group join limits -- test code
afriedle at open-mpi.org
Tue Aug 14 11:12:41 PDT 2007
I've attached a simple test program that should demonstrate the
limitations I'm seeing when joining multiple multicast groups; the idea
being to allow others to see the weirdness I'm seeing and make some
An MPI is needed to compile/run the test. No arguments are needed; the
test repeatedly joins groups (without leaving them) until an error
occurs, then intentionally hangs.
Here's some of the different behaviors I see with this test (OFED v1.2
is always used):
mpirun -np 1 ./jointest
On my 128 node machine 'odin' running OpenSM, I was able to join 891
groups quite a few times in a row. Then suddenly running the same test
again I was able to join only 5 groups. This behavior persists on this
node. I can go to another node on the same machine, and again be able
to join 891 groups. If I run the test separately on two different nodes
(that can still join 891 each), I am able to join a total of 891 groups
between both nodes before both tests error. If I run on one node that
errors after 5 groups and another that errors at 891 groups, the first
node joins 5 groups and the second joins 886 groups.
On a separate 8 node machine 'thor' running Cisco's SM on a Topspin
switch, I can join 14 groups.
mpirun -np 2 ./jointest (one node)
On odin I can join 892 groups, the thor machine is able to join 5 groups.
mpirun -np 2 ./jointest (two nodes)
Odin was able to join 4 groups for the first 3 runs, then was able to
join 14 groups repeatedly. Thor is able to join 5 groups consistently.
None of these results seem to match with any of the hardcoded limits
people have mentioned to me. I really need to figure out the cause of
this strange behavior, as most cases severely limit the usability of IB
multicast in MPI. Is my test code correct? Does anybody know what is
causing this, or where I could look/test to try and nail it down? I've
gotten suggestions that the problem lies in the SM, though I haven't
found anything blatantly wrong when reading relevant parts of the OpenSM
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 5907 bytes
Desc: not available
More information about the general