[ofa-general] Re: problem with mvapich2 over iwarp

Fri Jun 1 10:29:49 PDT 2007

Sean Hefty wrote:
>> (gdb) bt
>> #0 0x0000003c7cf0ae2b in __lll_mutex_lock_wait () from
>> /lib64/tls/libpthread.so.0
>> #1 0x000000000068db20 in ?? ()
>> #2 0x0000000060040a0a in ?? ()
>> #3 0x0000003c7cf08800 in pthread_cond_destroy@@GLIBC_2.3.2 () from
>> /lib64/tls/libpthread.so.0
>> #4 0x0000002a9579a09c in ucma_destroy_kern_id (fd=0, handle=6871424) at
>> src/cma.c:403
>> #5 0x0000002a9579a163 in rdma_destroy_id (id=0x68d980) at src/cma.c:425
>> #6 0x0000000000423ef9 in ib_finalize_rdma_cm ()
>> #7 0x00000000004183f6 in MPIDI_CH3I_CM_Finalize ()
>> #8 0x000000000044b03b in MPIDI_CH3_Finalize ()
>> #9 0x000000000043169e in MPID_Finalize ()
>> #10 0x000000000040c3ef in PMPI_Finalize ()
>> #11 0x0000000000403af4 in main ()
>> (gdb)
>>
>> I'm not sure I belive this stack trace fully, because
>> ucm_destroy_kern_id() doesn't call pthread_cond_destroy().  However
>> rdma_destroy_id() does.  So I'm thinking that ucma_destroy_id() has
>> already been executed and rdma_destroy_id() is freeing the cm_id and we
>> get stuck in pthread_cond_destroy() destroying the pthread condition object.
>>
>> I'm wondering if ya'll have ever seen this kind of hang?  I can kill the
>>    process and it exits, so I don't think we're stuck down in the
>> kernel IWCM or anything.
>>
>> Any thoughts?
> 
> I haven't seen any hangs like this, but I will perform a code inspection to see
> if any issues can be found.
> 
> - Sean

Thanks,

Perhaps someone is freeing the cond object twice.  That could cause a 
hang...