[ofa-general] problem with mvapich2 over iwarp

Steve Wise swise at opengridcomputing.com
Tue Jun 12 09:51:03 PDT 2007


Pete Wyckoff wrote:
> swise at opengridcomputing.com wrote on Fri, 01 Jun 2007 11:00 -0500:
>> I'm helping a customer who is trying to run mvapich2 over chelsio's 
>> rnic.  They're running a simple program that does an mpi init, 1000 
>> barriers, then a finalize.  They're using ofed-1.2-rc3, mpiexec-0.82, 
>> and mvapich2-0.9.8-p2 (not the mvapich2 from the ofed kit).  Also they 
>> aren't using mpd to start up stuff.  They're using pmi I guess (I'm not 
>> sure what pmi is, but the mpiexec has -comm=pmi.  BTW: I can run the 
>> same program fine on my 8 node cluster using mpd and the ofa mvapich2 code.
> 
> Hey Steve.  The "customer" contacted me about helping with the
> mpiexec aspects of things, assuming we're talking about the same
> people.  It's just an alternative to the MPD startup program, but
> uses the same PMI mechanisms under the hood as does MPD.  And it's a
> much better way to launch parallel jobs, but I'm biased since I
> wrote it.  :)
> 
> The hang in rdma_destroy_id() that you describe, does it happen for
> both both mpd and mpiexec startup?
> 
> I doubt that the mpiexec issue would matter, but frequently tell
> people to try it using straight mpirun just to make sure.  The PMI
> protocol under the hood is just a way for processes to exchange
> data---mpiexec doesn't know anything about MPI itself or iwarp, it
> just moves the information around.  So we generally don't see any
> problems with starting up mpich2 programs on all sorts of weird
> hardware.
> 
> Offering to help if you have any more information.  I've asked for
> them to send me debug logs of the mpd and mpiexec startups, but
> don't have an account on their machine yet.
> 
> 		-- Pete

Thanks Pete.

I've been out of town until today.  I think they have it working.  I 
believe the bug they saw was in an older version of mvapich2 that 
Sundeep fixed a while back.  After rebuilding and re-installing, they 
don't seem to hit it anymore.  The symptoms definitely seemed like the 
previous bug he fixed.

Anyway, thanks for helping and explaining mpiexec.  I'll hollar if 
anything else comes up.

Steve.





More information about the general mailing list