[openib-general] mvapich2 pmi scalability problems
Don.Dhondt at Bull.com
Don.Dhondt at Bull.com
Fri Jul 21 10:04:16 PDT 2006
We have been working with LLNL trying to debug a problem using slurm as
our resource manager,
mvapich2 as our MPI choice and OFED 1.0 as our infiniband stack. The
mvapich2 version is mvapich2-0.9.3.
The problem arises when we try to scale a simple mpi job. We can not go
much above 128 tasks
before we start timing out socket connections on the PMI exchanges.
Can anyone at OSU comment?
Processes PMI_KVS_Put PMI_KVS_Get PMI_KVS_Commit Num Procs ratio
Calls ratio
n32 1024 1248 1024 1 1
n64 4096 4544 4096 2 4
n96 9216 9888 9216 3 8
n128 16384 17280 16384 4 16
Comment from LLNL:
------------------
That is interesting! The ratio for MPICH2 is constant, so clearly
MVAPICH2 is doing something unusual (and unexpected, to me anyway).
What will MVAPCH2 do with really large parallel jobs? We regularly
run jobs with thousands to tens of thousands of tasks. If you have
located an MVAPICH2 expert, this would definitely be worth asking
about. It's use of PMI appears to be non-scalable.
----------------------
Any help is appreciated.
Regards,
Don Dhondt
GCOS 8 Communications Solutions Project Manager
Bull HN Information Systems Inc.
13430 N. Black Canyon Hwy., Phoenix, AZ 85029
Work (602) 862-5245 Fax (602) 862-4290
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/general/attachments/20060721/15519ffb/attachment.html>
More information about the general
mailing list