[ofa-general] RE: problem with mvapich2 over iwarp

Fri Jun 1 09:17:23 PDT 2007

>(gdb) bt
>#0 0x0000003c7cf0ae2b in __lll_mutex_lock_wait () from
>/lib64/tls/libpthread.so.0
>#1 0x000000000068db20 in ?? ()
>#2 0x0000000060040a0a in ?? ()
>#3 0x0000003c7cf08800 in pthread_cond_destroy@@GLIBC_2.3.2 () from
>/lib64/tls/libpthread.so.0
>#4 0x0000002a9579a09c in ucma_destroy_kern_id (fd=0, handle=6871424) at
>src/cma.c:403
>#5 0x0000002a9579a163 in rdma_destroy_id (id=0x68d980) at src/cma.c:425
>#6 0x0000000000423ef9 in ib_finalize_rdma_cm ()
>#7 0x00000000004183f6 in MPIDI_CH3I_CM_Finalize ()
>#8 0x000000000044b03b in MPIDI_CH3_Finalize ()
>#9 0x000000000043169e in MPID_Finalize ()
>#10 0x000000000040c3ef in PMPI_Finalize ()
>#11 0x0000000000403af4 in main ()
>(gdb)
>
>I'm not sure I belive this stack trace fully, because
>ucm_destroy_kern_id() doesn't call pthread_cond_destroy().  However
>rdma_destroy_id() does.  So I'm thinking that ucma_destroy_id() has
>already been executed and rdma_destroy_id() is freeing the cm_id and we
>get stuck in pthread_cond_destroy() destroying the pthread condition object.
>
>I'm wondering if ya'll have ever seen this kind of hang?  I can kill the
>    process and it exits, so I don't think we're stuck down in the
>kernel IWCM or anything.
>
>Any thoughts?

I haven't seen any hangs like this, but I will perform a code inspection to see
if any issues can be found.

- Sean