[ofa-general] Re: problem with mvapich2 over iwarp

Sundeep Narravula narravul at cse.ohio-state.edu
Fri Jun 1 10:05:32 PDT 2007


Steve,
  We have not seen this hang before. Not sure what is happening at this
point. I will try to see through the code for this behavior.

btw, mvapich2-0.9.8-p2 and the ofa mvapich2 code are identical at this
point.

  --Sundeep.

On Fri, 1 Jun 2007, Steve Wise wrote:

> Sundeep/Sean,
>
> I'm helping a customer who is trying to run mvapich2 over chelsio's
> rnic.  They're running a simple program that does an mpi init, 1000
> barriers, then a finalize.  They're using ofed-1.2-rc3, mpiexec-0.82,
> and mvapich2-0.9.8-p2 (not the mvapich2 from the ofed kit).  Also they
> aren't using mpd to start up stuff.  They're using pmi I guess (I'm not
> sure what pmi is, but the mpiexec has -comm=pmi.  BTW: I can run the
> same program fine on my 8 node cluster using mpd and the ofa mvapich2 code.
>
> On their cluster a 4 node/4 process job hangs in finalize almost always.
>   When it hangs, one process is always stuck in rdma_destroy_id().
>
> Here's the stack:
>
> (gdb) bt
> #0 0x0000003c7cf0ae2b in __lll_mutex_lock_wait () from
> /lib64/tls/libpthread.so.0
> #1 0x000000000068db20 in ?? ()
> #2 0x0000000060040a0a in ?? ()
> #3 0x0000003c7cf08800 in pthread_cond_destroy@@GLIBC_2.3.2 () from
> /lib64/tls/libpthread.so.0
> #4 0x0000002a9579a09c in ucma_destroy_kern_id (fd=0, handle=6871424) at
> src/cma.c:403
> #5 0x0000002a9579a163 in rdma_destroy_id (id=0x68d980) at src/cma.c:425
> #6 0x0000000000423ef9 in ib_finalize_rdma_cm ()
> #7 0x00000000004183f6 in MPIDI_CH3I_CM_Finalize ()
> #8 0x000000000044b03b in MPIDI_CH3_Finalize ()
> #9 0x000000000043169e in MPID_Finalize ()
> #10 0x000000000040c3ef in PMPI_Finalize ()
> #11 0x0000000000403af4 in main ()
> (gdb)
>
> I'm not sure I belive this stack trace fully, because
> ucm_destroy_kern_id() doesn't call pthread_cond_destroy().  However
> rdma_destroy_id() does.  So I'm thinking that ucma_destroy_id() has
> already been executed and rdma_destroy_id() is freeing the cm_id and we
> get stuck in pthread_cond_destroy() destroying the pthread condition object.
>
> I'm wondering if ya'll have ever seen this kind of hang?  I can kill the
>     process and it exits, so I don't think we're stuck down in the
> kernel IWCM or anything.
>
> Any thoughts?
>
> Thanks,
>
> Steve.
>





More information about the general mailing list