[openib-general] rdma cm process hang
Steve Wise
swise at opengridcomputing.com
Wed Aug 2 07:53:25 PDT 2006
This smells like an amso or iwcm problem.
On Tue, 2006-08-01 at 17:34 -0400, Pete Wyckoff wrote:
> Using the iwarp branch of r8688, with linux-2.6.17.7 on up to date
> x86_64 FC4 SMP with Ammasso cards, I can hang the client side during
> RDMA CM connection setup.
>
> The scenario is:
>
> start server side process on some other node
> start client process
> have server die after RDMA_CM_EVENT_CONNECT_REQUEST arrives,
> but before calling rdma_accept
> hit ctrl-C on client
>
> The last bits of the console log (from c2 debug) are:
>
> c2: c2_create_qp:248
> c2: c2_query_pkey:110
> c2: c2_qp_modify:145 qp=ffff81007fe3b980, IB_QPS_RESET --> IB_QPS_INIT
> c2: c2_qp_modify:243 qp=ffff81007fe3b980, cur_state=IB_QPS_INIT
> c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=1
> c2: c2_connect:598
> c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=2
>
> The process is in S state before the ctrl-c, here's a traceback
> (waiting in rdma_get_cm_event):
>
> ardma-rdmacm S ffff81003d0f9e68 0 2914 2842 (NOTLB)
> ffff81003d0f9e68 0000000000000000 0000000000000000 00000000000007b4
> ffff81003ef1d280 ffff81007fd06080 0000000000000000 0000000000000000
> 0000000000000000 0000000000000000
> Call Trace: <ffffffff8814559c>{:rdma_ucm:ucma_get_event+268}
> <ffffffff80241e10>{autoremove_wake_function+0} <ffffffff881450ef>{:rdma_ucm:ucma_write+111}
> <ffffffff80277aed>{vfs_write+189} <ffffffff80278543>{sys_write+83}
> <ffffffff80209cbe>{system_call+126}
>
> Then after ctrl-C, one more console log entry:
>
> c2: c2_destroy_qp:290 qp=ffff81007fe3b980,qp->state=1
>
> and now the process is unkillable (but the node does not oops):
>
> ardma-rdmacm D ffff81003d0f9bf8 0 2914 2842 (L-TLB)
> ffff81003d0f9bf8 ffffc2000001ffff ffff81007eee5c80 0000000000009ee7
> ffff81003ef1d280 ffff81003f060aa0 ffffffff80232646 ffff81000100c130
> ffff81003ec3d140 ffff81003dc85800
> Call Trace: <ffffffff80232646>{on_each_cpu+38} <ffffffff80265d37>{__remove_vm_area+55}
> <ffffffff88104073>{:iw_c2:c2_free_qp+355} <ffffffff80241e10>{autoremove_wake_function+0}
> <ffffffff880ffd44>{:iw_c2:c2_destroy_qp+52} <ffffffff880f0f51>{:ib_core:ib_destroy_qp+49}
> <ffffffff8810f79a>{:ib_uverbs:ib_uverbs_close+410} <ffffffff80278b32>{__fput+178}
> <ffffffff80275fb8>{filp_close+104} <ffffffff8022e7ba>{put_files_struct+122}
> <ffffffff8022fc14>{do_exit+596} <ffffffff80238f4f>{__dequeue_signal+495}
> <ffffffff80230368>{do_group_exit+216} <ffffffff8023a378>{get_signal_to_deliver+1192}
> <ffffffff802093d1>{do_signal+129} <ffffffff88145685>{:rdma_ucm:ucma_get_event+501}
> <ffffffff881450ef>{:rdma_ucm:ucma_write+111} <ffffffff80277aed>{vfs_write+189}
> <ffffffff80209d47>{sysret_signal+28} <ffffffff80209fcb>{ptregscall_common+103}
>
> Once I figure out the bug in the server side code I will hopefully
> not have this problem anymore. But thought you'd like to see it.
>
> -- Pete
>
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>
More information about the general
mailing list