[openib-general] rdma cm process hang

Steve Wise swise at opengridcomputing.com
Wed Aug 2 08:09:39 PDT 2006


This hang is due to 2 things:

1) the amso card will _never_ timeout a connection that is awaiting an
MP reply.  That is exactly what is happening here.  The fix for this
(timeout mpa connection setup stalls) is a firmware fix and we don't
have the firmware src.

2) the IWCM holds a reference on the QP until connection setup either
succeeds or fails.  So that's where we get the stall.  The amso driver
is waiting for the reference on the qp to go to zero, and it never will
because the amso firmware will never timeout the stalled mpa connection
setup.

Lemme look more at the amso driver and see if this can be avoided.
Perhaps the amso driver can blow away the qp and stop the stall.  I
thought thats what it did, but I'll look...

Steve.


On Tue, 2006-08-01 at 17:34 -0400, Pete Wyckoff wrote:
> Using the iwarp branch of r8688, with linux-2.6.17.7 on up to date
> x86_64 FC4 SMP with Ammasso cards, I can hang the client side during
> RDMA CM connection setup.
> 
> The scenario is:
> 
>     start server side process on some other node
>     start client process
>     have server die after RDMA_CM_EVENT_CONNECT_REQUEST arrives,
> 	but before calling rdma_accept
>     hit ctrl-C on client
> 
> The last bits of the console log (from c2 debug) are:
> 
>     c2: c2_create_qp:248
>     c2: c2_query_pkey:110
>     c2: c2_qp_modify:145 qp=ffff81007fe3b980, IB_QPS_RESET --> IB_QPS_INIT
>     c2: c2_qp_modify:243 qp=ffff81007fe3b980, cur_state=IB_QPS_INIT
>     c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=1
>     c2: c2_connect:598
>     c2: c2_get_qp Returning QP=ffff81007fe3b980 for QPN=1, device=ffff81003dc85800, refcount=2
> 
> The process is in S state before the ctrl-c, here's a traceback
> (waiting in rdma_get_cm_event):
> 
> ardma-rdmacm  S ffff81003d0f9e68     0  2914   2842                     (NOTLB)
> ffff81003d0f9e68 0000000000000000 0000000000000000 00000000000007b4 
>        ffff81003ef1d280 ffff81007fd06080 0000000000000000 0000000000000000 
>        0000000000000000 0000000000000000 
> Call Trace: <ffffffff8814559c>{:rdma_ucm:ucma_get_event+268}
>        <ffffffff80241e10>{autoremove_wake_function+0} <ffffffff881450ef>{:rdma_ucm:ucma_write+111}
>        <ffffffff80277aed>{vfs_write+189} <ffffffff80278543>{sys_write+83}
>        <ffffffff80209cbe>{system_call+126}
> 
> Then after ctrl-C, one more console log entry:
> 
>     c2: c2_destroy_qp:290 qp=ffff81007fe3b980,qp->state=1
> 
> and now the process is unkillable (but the node does not oops):
> 
> ardma-rdmacm  D ffff81003d0f9bf8     0  2914   2842                     (L-TLB)
> ffff81003d0f9bf8 ffffc2000001ffff ffff81007eee5c80 0000000000009ee7 
>        ffff81003ef1d280 ffff81003f060aa0 ffffffff80232646 ffff81000100c130 
>        ffff81003ec3d140 ffff81003dc85800 
> Call Trace: <ffffffff80232646>{on_each_cpu+38} <ffffffff80265d37>{__remove_vm_area+55}
>        <ffffffff88104073>{:iw_c2:c2_free_qp+355} <ffffffff80241e10>{autoremove_wake_function+0}
>        <ffffffff880ffd44>{:iw_c2:c2_destroy_qp+52} <ffffffff880f0f51>{:ib_core:ib_destroy_qp+49}
>        <ffffffff8810f79a>{:ib_uverbs:ib_uverbs_close+410} <ffffffff80278b32>{__fput+178}
>        <ffffffff80275fb8>{filp_close+104} <ffffffff8022e7ba>{put_files_struct+122}
>        <ffffffff8022fc14>{do_exit+596} <ffffffff80238f4f>{__dequeue_signal+495}
>        <ffffffff80230368>{do_group_exit+216} <ffffffff8023a378>{get_signal_to_deliver+1192}
>        <ffffffff802093d1>{do_signal+129} <ffffffff88145685>{:rdma_ucm:ucma_get_event+501}
>        <ffffffff881450ef>{:rdma_ucm:ucma_write+111} <ffffffff80277aed>{vfs_write+189}
>        <ffffffff80209d47>{sysret_signal+28} <ffffffff80209fcb>{ptregscall_common+103}
> 
> Once I figure out the bug in the server side code I will hopefully
> not have this problem anymore.  But thought you'd like to see it.
> 
> 		-- Pete
> 
> _______________________________________________
> openib-general mailing list
> openib-general at openib.org
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> 





More information about the general mailing list