[openib-general] rdma cm process hang
Pete Wyckoff
pw at osc.edu
Tue Aug 1 14:34:16 PDT 2006
Using the iwarp branch of r8688, with linux-2.6.17.7 on up to date
x86_64 FC4 SMP with Ammasso cards, I can hang the client side during
RDMA CM connection setup.
The scenario is:
start server side process on some other node
start client process
have server die after RDMA_CM_EVENT_CONNECT_REQUEST arrives,
but before calling rdma_accept
hit ctrl-C on client
The last bits of the console log (from c2 debug) are:
c2: c2_create_qp:248
c2: c2_query_pkey:110
c2: c2_qp_modify:145 qp=ffff81007fe3b980, IB_QPS_RESET --> IB_QPS_INIT
c2: c2_qp_modify:243 qp=ffff81007fe3b980, cur_state=IB_QPS_INIT
c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=1
c2: c2_connect:598
c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=2
The process is in S state before the ctrl-c, here's a traceback
(waiting in rdma_get_cm_event):
ardma-rdmacm S ffff81003d0f9e68 0 2914 2842 (NOTLB)
ffff81003d0f9e68 0000000000000000 0000000000000000 00000000000007b4
ffff81003ef1d280 ffff81007fd06080 0000000000000000 0000000000000000
0000000000000000 0000000000000000
Call Trace: <ffffffff8814559c>{:rdma_ucm:ucma_get_event+268}
<ffffffff80241e10>{autoremove_wake_function+0} <ffffffff881450ef>{:rdma_ucm:ucma_write+111}
<ffffffff80277aed>{vfs_write+189} <ffffffff80278543>{sys_write+83}
<ffffffff80209cbe>{system_call+126}
Then after ctrl-C, one more console log entry:
c2: c2_destroy_qp:290 qp=ffff81007fe3b980,qp->state=1
and now the process is unkillable (but the node does not oops):
ardma-rdmacm D ffff81003d0f9bf8 0 2914 2842 (L-TLB)
ffff81003d0f9bf8 ffffc2000001ffff ffff81007eee5c80 0000000000009ee7
ffff81003ef1d280 ffff81003f060aa0 ffffffff80232646 ffff81000100c130
ffff81003ec3d140 ffff81003dc85800
Call Trace: <ffffffff80232646>{on_each_cpu+38} <ffffffff80265d37>{__remove_vm_area+55}
<ffffffff88104073>{:iw_c2:c2_free_qp+355} <ffffffff80241e10>{autoremove_wake_function+0}
<ffffffff880ffd44>{:iw_c2:c2_destroy_qp+52} <ffffffff880f0f51>{:ib_core:ib_destroy_qp+49}
<ffffffff8810f79a>{:ib_uverbs:ib_uverbs_close+410} <ffffffff80278b32>{__fput+178}
<ffffffff80275fb8>{filp_close+104} <ffffffff8022e7ba>{put_files_struct+122}
<ffffffff8022fc14>{do_exit+596} <ffffffff80238f4f>{__dequeue_signal+495}
<ffffffff80230368>{do_group_exit+216} <ffffffff8023a378>{get_signal_to_deliver+1192}
<ffffffff802093d1>{do_signal+129} <ffffffff88145685>{:rdma_ucm:ucma_get_event+501}
<ffffffff881450ef>{:rdma_ucm:ucma_write+111} <ffffffff80277aed>{vfs_write+189}
<ffffffff80209d47>{sysret_signal+28} <ffffffff80209fcb>{ptregscall_common+103}
Once I figure out the bug in the server side code I will hopefully
not have this problem anymore. But thought you'd like to see it.
-- Pete
More information about the general
mailing list