[ofa-general] multicast group join limits -- test code

Andrew Friedley afriedle at open-mpi.org
Tue Aug 14 11:12:41 PDT 2007


I've attached a simple test program that should demonstrate the 
limitations I'm seeing when joining multiple multicast groups; the idea 
being to allow others to see the weirdness I'm seeing and make some 
progress.

An MPI is needed to compile/run the test.  No arguments are needed; the 
test repeatedly joins groups (without leaving them) until an error 
occurs, then intentionally hangs.

Here's some of the different behaviors I see with this test (OFED v1.2 
is always used):

mpirun -np 1 ./jointest

On my 128 node machine 'odin' running OpenSM, I was able to join 891 
groups quite a few times in a row.  Then suddenly running the same test 
again I was able to join only 5 groups.  This behavior persists on this 
node.  I can go to another node on the same machine, and again be able 
to join 891 groups.  If I run the test separately on two different nodes 
(that can still join 891 each), I am able to join a total of 891 groups 
between both nodes before both tests error.  If I run on one node that 
errors after 5 groups and another that errors at 891 groups, the first 
node joins 5 groups and the second joins 886 groups.

On a separate 8 node machine 'thor' running Cisco's SM on a Topspin 
switch, I can join 14 groups.

mpirun -np 2 ./jointest   (one node)

On odin I can join 892 groups, the thor machine is able to join 5 groups.

mpirun -np 2 ./jointest   (two nodes)

Odin was able to join 4 groups for the first 3 runs, then was able to 
join 14 groups repeatedly.  Thor is able to join 5 groups consistently.


None of these results seem to match with any of the hardcoded limits 
people have mentioned to me.  I really need to figure out the cause of 
this strange behavior, as most cases severely limit the usability of IB 
multicast in MPI.  Is my test code correct?  Does anybody know what is 
causing this, or where I could look/test to try and nail it down?  I've 
gotten suggestions that the problem lies in the SM, though I haven't 
found anything blatantly wrong when reading relevant parts of the OpenSM 
code.

Andrew
-------------- next part --------------
A non-text attachment was scrubbed...
Name: jointest.tar.gz
Type: application/x-gzip
Size: 5907 bytes
Desc: not available
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20070814/685b6403/attachment.bin>


More information about the general mailing list