[openib-general] rdma cm process hang

Pete Wyckoff pw at osc.edu
Wed Aug 2 08:57:21 PDT 2006


swise at opengridcomputing.com wrote on Wed, 02 Aug 2006 10:09 -0500:
> This hang is due to 2 things:
> 
> 1) the amso card will _never_ timeout a connection that is awaiting an
> MP reply.  That is exactly what is happening here.  The fix for this
> (timeout mpa connection setup stalls) is a firmware fix and we don't
> have the firmware src.
> 
> 2) the IWCM holds a reference on the QP until connection setup either
> succeeds or fails.  So that's where we get the stall.  The amso driver
> is waiting for the reference on the qp to go to zero, and it never will
> because the amso firmware will never timeout the stalled mpa connection
> setup.
> 
> Lemme look more at the amso driver and see if this can be avoided.
> Perhaps the amso driver can blow away the qp and stop the stall.  I
> thought thats what it did, but I'll look...

Thanks for looking.  I'd just come to the conclusion that it was
waiting on the qp refcnt, but didn't get much farther when your mail
arrived.

Testing on mthca would be a bit more difficult here, but hopefully
that's not an issue now.

Here's an easier test case using ucmatose.  Just on a single
machine, pick an IP that is theoretically reachable but has nothing
listening on it, viz:

    am30$ ip a s dev iw2
    5: iw2: <NOARP,UP,10000> mtu 1500 qdisc noqueue 
	link/ether 00:0d:b2:00:04:8f brd 00:00:00:00:00:00
	inet 10.100.9.30/24 brd 10.100.9.255 scope global iw2
    am30$ ucmatose 10.100.9.31
    cmatose: starting client
    cmatose: connecting

Then hit ctrl-C.  The full console log is (with last line appearing
only after ctrl-C):

c2: c2_alloc_ucontext:135
c2: c2_query_device:68
c2: c2_alloc_pd:163
c2: c2_create_qp:248
c2: c2_query_pkey:110
c2: c2_qp_modify:145 qp=ffff81007f1f3d80, IB_QPS_RESET --> IB_QPS_INIT
c2: c2_qp_modify:243 qp=ffff81007f1f3d80, cur_state=IB_QPS_INIT
c2: c2_reg_user_mr:442
c2: i=1, offset=2048, page_size=4096, length=100, user_base=504800, virt_base=504800, acc=00000098, c2mr=ffff81003f002f00
c2:     [0] 3d24c000
c2: c2_get_qp Returning QP=ffff81007f1f3d80 for QPN=1, device=ffff81003df28800, refcount=1
c2: c2_connect:598
c2: c2_get_qp Returning QP=ffff81007f1f3d80 for QPN=1, device=ffff81003df28800, refcount=2
c2: c2_destroy_qp:290 qp=ffff81007f1f3d80,qp->state=1

		-- Pete




More information about the general mailing list