[openib-general] rdma cm process hang

Steve Wise swise at opengridcomputing.com
Thu Aug 3 06:19:30 PDT 2006


On Wed, 2006-08-02 at 11:57 -0400, Pete Wyckoff wrote:
> swise at opengridcomputing.com wrote on Wed, 02 Aug 2006 10:09 -0500:
> > This hang is due to 2 things:
> > 
> > 1) the amso card will _never_ timeout a connection that is awaiting an
> > MP reply.  That is exactly what is happening here.  The fix for this
> > (timeout mpa connection setup stalls) is a firmware fix and we don't
> > have the firmware src.
> > 
> > 2) the IWCM holds a reference on the QP until connection setup either
> > succeeds or fails.  So that's where we get the stall.  The amso driver
> > is waiting for the reference on the qp to go to zero, and it never will
> > because the amso firmware will never timeout the stalled mpa connection
> > setup.
> > 
> > Lemme look more at the amso driver and see if this can be avoided.
> > Perhaps the amso driver can blow away the qp and stop the stall.  I
> > thought thats what it did, but I'll look...
> 
> Thanks for looking.  I'd just come to the conclusion that it was
> waiting on the qp refcnt, but didn't get much farther when your mail
> arrived.
> 

I don't know when, or if I'll have time to address this limitation in
the ammasso firmware.  But there is a way (if anyone wants to implement
it):

1) add a timer to the c2_qp struct and start it when c2_llp_connect() is
called.

2) if the timer fires, generate a CONNECT_REPLY upcall to the IWCM with
status TIMEDOUT.  Mark in the qp that the connect timed out. 

3) deal with the rare condition that the timer fires at or about the
same time the connection really does get established:  if the adapter
passes up a CCAE_ACTIVE_CONNECT_RESULTS -after- the timer fires but
before the qp is destroyed by the consumer, then you must squelch this
event and probably destroy the HWQP at least from the adapter's
perspective...


> Testing on mthca would be a bit more difficult here, but hopefully
> that's not an issue now.

There's no need.  This is an Ammaso-only issue.

Steve.





More information about the general mailing list