[openib-general] rdma cm process hang

Tue Aug 1 14:34:16 PDT 2006

Using the iwarp branch of r8688, with linux-2.6.17.7 on up to date
x86_64 FC4 SMP with Ammasso cards, I can hang the client side during
RDMA CM connection setup.

The scenario is:

    start server side process on some other node
    start client process
    have server die after RDMA_CM_EVENT_CONNECT_REQUEST arrives,
	but before calling rdma_accept
    hit ctrl-C on client

The last bits of the console log (from c2 debug) are:

    c2: c2_create_qp:248
    c2: c2_query_pkey:110
    c2: c2_qp_modify:145 qp=ffff81007fe3b980, IB_QPS_RESET --> IB_QPS_INIT
    c2: c2_qp_modify:243 qp=ffff81007fe3b980, cur_state=IB_QPS_INIT
    c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=1
    c2: c2_connect:598
    c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=2

The process is in S state before the ctrl-c, here's a traceback
(waiting in rdma_get_cm_event):

ardma-rdmacm  S ffff81003d0f9e68     0  2914   2842                     (NOTLB)
ffff81003d0f9e68 0000000000000000 0000000000000000 00000000000007b4 
       ffff81003ef1d280 ffff81007fd06080 0000000000000000 0000000000000000 
       0000000000000000 0000000000000000 
Call Trace: <ffffffff8814559c>{:rdma_ucm:ucma_get_event+268}
       <ffffffff80241e10>{autoremove_wake_function+0} <ffffffff881450ef>{:rdma_ucm:ucma_write+111}
       <ffffffff80277aed>{vfs_write+189} <ffffffff80278543>{sys_write+83}
       <ffffffff80209cbe>{system_call+126}

Then after ctrl-C, one more console log entry:

    c2: c2_destroy_qp:290 qp=ffff81007fe3b980,qp->state=1

and now the process is unkillable (but the node does not oops):

ardma-rdmacm  D ffff81003d0f9bf8     0  2914   2842                     (L-TLB)
ffff81003d0f9bf8 ffffc2000001ffff ffff81007eee5c80 0000000000009ee7 
       ffff81003ef1d280 ffff81003f060aa0 ffffffff80232646 ffff81000100c130 
       ffff81003ec3d140 ffff81003dc85800 
Call Trace: <ffffffff80232646>{on_each_cpu+38} <ffffffff80265d37>{__remove_vm_area+55}
       <ffffffff88104073>{:iw_c2:c2_free_qp+355} <ffffffff80241e10>{autoremove_wake_function+0}
       <ffffffff880ffd44>{:iw_c2:c2_destroy_qp+52} <ffffffff880f0f51>{:ib_core:ib_destroy_qp+49}
       <ffffffff8810f79a>{:ib_uverbs:ib_uverbs_close+410} <ffffffff80278b32>{__fput+178}
       <ffffffff80275fb8>{filp_close+104} <ffffffff8022e7ba>{put_files_struct+122}
       <ffffffff8022fc14>{do_exit+596} <ffffffff80238f4f>{__dequeue_signal+495}
       <ffffffff80230368>{do_group_exit+216} <ffffffff8023a378>{get_signal_to_deliver+1192}
       <ffffffff802093d1>{do_signal+129} <ffffffff88145685>{:rdma_ucm:ucma_get_event+501}
       <ffffffff881450ef>{:rdma_ucm:ucma_write+111} <ffffffff80277aed>{vfs_write+189}
       <ffffffff80209d47>{sysret_signal+28} <ffffffff80209fcb>{ptregscall_common+103}

Once I figure out the bug in the server side code I will hopefully
not have this problem anymore.  But thought you'd like to see it.

		-- Pete