[openib-general] rdma cm process hang

Wed Aug 2 07:53:25 PDT 2006

This smells like an amso or iwcm problem.

On Tue, 2006-08-01 at 17:34 -0400, Pete Wyckoff wrote:
> Using the iwarp branch of r8688, with linux-2.6.17.7 on up to date
> x86_64 FC4 SMP with Ammasso cards, I can hang the client side during
> RDMA CM connection setup.
> 
> The scenario is:
> 
>     start server side process on some other node
>     start client process
>     have server die after RDMA_CM_EVENT_CONNECT_REQUEST arrives,
> 	but before calling rdma_accept
>     hit ctrl-C on client
> 
> The last bits of the console log (from c2 debug) are:
> 
>     c2: c2_create_qp:248
>     c2: c2_query_pkey:110
>     c2: c2_qp_modify:145 qp=ffff81007fe3b980, IB_QPS_RESET --> IB_QPS_INIT
>     c2: c2_qp_modify:243 qp=ffff81007fe3b980, cur_state=IB_QPS_INIT
>     c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=1
>     c2: c2_connect:598
>     c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=2
> 
> The process is in S state before the ctrl-c, here's a traceback
> (waiting in rdma_get_cm_event):
> 
> ardma-rdmacm  S ffff81003d0f9e68     0  2914   2842                     (NOTLB)
> ffff81003d0f9e68 0000000000000000 0000000000000000 00000000000007b4 
>        ffff81003ef1d280 ffff81007fd06080 0000000000000000 0000000000000000 
>        0000000000000000 0000000000000000 
> Call Trace: <ffffffff8814559c>{:rdma_ucm:ucma_get_event+268}
>        <ffffffff80241e10>{autoremove_wake_function+0} <ffffffff881450ef>{:rdma_ucm:ucma_write+111}
>        <ffffffff80277aed>{vfs_write+189} <ffffffff80278543>{sys_write+83}
>        <ffffffff80209cbe>{system_call+126}
> 
> Then after ctrl-C, one more console log entry:
> 
>     c2: c2_destroy_qp:290 qp=ffff81007fe3b980,qp->state=1
> 
> and now the process is unkillable (but the node does not oops):
> 
> ardma-rdmacm  D ffff81003d0f9bf8     0  2914   2842                     (L-TLB)
> ffff81003d0f9bf8 ffffc2000001ffff ffff81007eee5c80 0000000000009ee7 
>        ffff81003ef1d280 ffff81003f060aa0 ffffffff80232646 ffff81000100c130 
>        ffff81003ec3d140 ffff81003dc85800 
> Call Trace: <ffffffff80232646>{on_each_cpu+38} <ffffffff80265d37>{__remove_vm_area+55}
>        <ffffffff88104073>{:iw_c2:c2_free_qp+355} <ffffffff80241e10>{autoremove_wake_function+0}
>        <ffffffff880ffd44>{:iw_c2:c2_destroy_qp+52} <ffffffff880f0f51>{:ib_core:ib_destroy_qp+49}
>        <ffffffff8810f79a>{:ib_uverbs:ib_uverbs_close+410} <ffffffff80278b32>{__fput+178}
>        <ffffffff80275fb8>{filp_close+104} <ffffffff8022e7ba>{put_files_struct+122}
>        <ffffffff8022fc14>{do_exit+596} <ffffffff80238f4f>{__dequeue_signal+495}
>        <ffffffff80230368>{do_group_exit+216} <ffffffff8023a378>{get_signal_to_deliver+1192}
>        <ffffffff802093d1>{do_signal+129} <ffffffff88145685>{:rdma_ucm:ucma_get_event+501}
>        <ffffffff881450ef>{:rdma_ucm:ucma_write+111} <ffffffff80277aed>{vfs_write+189}
>        <ffffffff80209d47>{sysret_signal+28} <ffffffff80209fcb>{ptregscall_common+103}
> 
> Once I figure out the bug in the server side code I will hopefully
> not have this problem anymore.  But thought you'd like to see it.
> 
> 		-- Pete
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>