[ofa-general] problem with mvapich2 over iwarp
Steve Wise
swise at opengridcomputing.com
Fri Jun 1 09:00:46 PDT 2007
Sundeep/Sean,
I'm helping a customer who is trying to run mvapich2 over chelsio's
rnic. They're running a simple program that does an mpi init, 1000
barriers, then a finalize. They're using ofed-1.2-rc3, mpiexec-0.82,
and mvapich2-0.9.8-p2 (not the mvapich2 from the ofed kit). Also they
aren't using mpd to start up stuff. They're using pmi I guess (I'm not
sure what pmi is, but the mpiexec has -comm=pmi. BTW: I can run the
same program fine on my 8 node cluster using mpd and the ofa mvapich2 code.
On their cluster a 4 node/4 process job hangs in finalize almost always.
When it hangs, one process is always stuck in rdma_destroy_id().
Here's the stack:
(gdb) bt
#0 0x0000003c7cf0ae2b in __lll_mutex_lock_wait () from
/lib64/tls/libpthread.so.0
#1 0x000000000068db20 in ?? ()
#2 0x0000000060040a0a in ?? ()
#3 0x0000003c7cf08800 in pthread_cond_destroy@@GLIBC_2.3.2 () from
/lib64/tls/libpthread.so.0
#4 0x0000002a9579a09c in ucma_destroy_kern_id (fd=0, handle=6871424) at
src/cma.c:403
#5 0x0000002a9579a163 in rdma_destroy_id (id=0x68d980) at src/cma.c:425
#6 0x0000000000423ef9 in ib_finalize_rdma_cm ()
#7 0x00000000004183f6 in MPIDI_CH3I_CM_Finalize ()
#8 0x000000000044b03b in MPIDI_CH3_Finalize ()
#9 0x000000000043169e in MPID_Finalize ()
#10 0x000000000040c3ef in PMPI_Finalize ()
#11 0x0000000000403af4 in main ()
(gdb)
I'm not sure I belive this stack trace fully, because
ucm_destroy_kern_id() doesn't call pthread_cond_destroy(). However
rdma_destroy_id() does. So I'm thinking that ucma_destroy_id() has
already been executed and rdma_destroy_id() is freeing the cm_id and we
get stuck in pthread_cond_destroy() destroying the pthread condition object.
I'm wondering if ya'll have ever seen this kind of hang? I can kill the
process and it exits, so I don't think we're stuck down in the
kernel IWCM or anything.
Any thoughts?
Thanks,
Steve.
More information about the general
mailing list