[ofa-general] problem with mvapich2 over iwarp

Steve Wise swise at opengridcomputing.com
Tue Jun 12 10:21:45 PDT 2007


Steve Wise wrote:
> Pete Wyckoff wrote:
>> swise at opengridcomputing.com wrote on Fri, 01 Jun 2007 11:00 -0500:
>>> I'm helping a customer who is trying to run mvapich2 over chelsio's 
>>> rnic.  They're running a simple program that does an mpi init, 1000 
>>> barriers, then a finalize.  They're using ofed-1.2-rc3, mpiexec-0.82, 
>>> and mvapich2-0.9.8-p2 (not the mvapich2 from the ofed kit).  Also 
>>> they aren't using mpd to start up stuff.  They're using pmi I guess 
>>> (I'm not sure what pmi is, but the mpiexec has -comm=pmi.  BTW: I can 
>>> run the same program fine on my 8 node cluster using mpd and the ofa 
>>> mvapich2 code.
>>
>> Hey Steve.  The "customer" contacted me about helping with the
>> mpiexec aspects of things, assuming we're talking about the same
>> people.  It's just an alternative to the MPD startup program, but
>> uses the same PMI mechanisms under the hood as does MPD.  And it's a
>> much better way to launch parallel jobs, but I'm biased since I
>> wrote it.  :)
>>
>> The hang in rdma_destroy_id() that you describe, does it happen for
>> both both mpd and mpiexec startup?
>>
>> I doubt that the mpiexec issue would matter, but frequently tell
>> people to try it using straight mpirun just to make sure.  The PMI
>> protocol under the hood is just a way for processes to exchange
>> data---mpiexec doesn't know anything about MPI itself or iwarp, it
>> just moves the information around.  So we generally don't see any
>> problems with starting up mpich2 programs on all sorts of weird
>> hardware.
>>
>> Offering to help if you have any more information.  I've asked for
>> them to send me debug logs of the mpd and mpiexec startups, but
>> don't have an account on their machine yet.
>>
>>         -- Pete
> 
> Thanks Pete.
> 
> I've been out of town until today.  I think they have it working.  I 
> believe the bug they saw was in an older version of mvapich2 that 
> Sundeep fixed a while back.  After rebuilding and re-installing, they 
> don't seem to hit it anymore.  The symptoms definitely seemed like the 
> previous bug he fixed.
> 
> Anyway, thanks for helping and explaining mpiexec.  I'll hollar if 
> anything else comes up.
> 
> Steve.

Ignore this last reply.  I hadn't caught up on my email for that issue 
and I think maybe there are still problems with all this.

Steve.



More information about the general mailing list