[ofa-general] problem with mvapich2 over iwarp
Steve Wise
swise at opengridcomputing.com
Tue Jun 12 10:21:45 PDT 2007
Steve Wise wrote:
> Pete Wyckoff wrote:
>> swise at opengridcomputing.com wrote on Fri, 01 Jun 2007 11:00 -0500:
>>> I'm helping a customer who is trying to run mvapich2 over chelsio's
>>> rnic. They're running a simple program that does an mpi init, 1000
>>> barriers, then a finalize. They're using ofed-1.2-rc3, mpiexec-0.82,
>>> and mvapich2-0.9.8-p2 (not the mvapich2 from the ofed kit). Also
>>> they aren't using mpd to start up stuff. They're using pmi I guess
>>> (I'm not sure what pmi is, but the mpiexec has -comm=pmi. BTW: I can
>>> run the same program fine on my 8 node cluster using mpd and the ofa
>>> mvapich2 code.
>>
>> Hey Steve. The "customer" contacted me about helping with the
>> mpiexec aspects of things, assuming we're talking about the same
>> people. It's just an alternative to the MPD startup program, but
>> uses the same PMI mechanisms under the hood as does MPD. And it's a
>> much better way to launch parallel jobs, but I'm biased since I
>> wrote it. :)
>>
>> The hang in rdma_destroy_id() that you describe, does it happen for
>> both both mpd and mpiexec startup?
>>
>> I doubt that the mpiexec issue would matter, but frequently tell
>> people to try it using straight mpirun just to make sure. The PMI
>> protocol under the hood is just a way for processes to exchange
>> data---mpiexec doesn't know anything about MPI itself or iwarp, it
>> just moves the information around. So we generally don't see any
>> problems with starting up mpich2 programs on all sorts of weird
>> hardware.
>>
>> Offering to help if you have any more information. I've asked for
>> them to send me debug logs of the mpd and mpiexec startups, but
>> don't have an account on their machine yet.
>>
>> -- Pete
>
> Thanks Pete.
>
> I've been out of town until today. I think they have it working. I
> believe the bug they saw was in an older version of mvapich2 that
> Sundeep fixed a while back. After rebuilding and re-installing, they
> don't seem to hit it anymore. The symptoms definitely seemed like the
> previous bug he fixed.
>
> Anyway, thanks for helping and explaining mpiexec. I'll hollar if
> anything else comes up.
>
> Steve.
Ignore this last reply. I hadn't caught up on my email for that issue
and I think maybe there are still problems with all this.
Steve.
More information about the general
mailing list