[ofw] [PATCH] ib/cm: fix handling failed send completions

Sean Hefty sean.hefty at intel.com
Mon Jan 4 11:57:19 PST 2010


>1) what larger set of application problems does this patch address?

>>> For example, for a short lived connection, it was observed that
>>> a REP mad completed with status canceled.  This is normal.  However,
>>> the user already attempted to disconnect the connection by sending
>>> a DREQ.  This left the cep in the DREQ_SENT state by the time that
>>> the REP mad completed.  Since the REP status was not success, but the
>>> state was DREQ_SENT, the code assumed that the DREQ had failed and
>>> transitioned the cep into TIMEWAIT.  The result is that the DREQ is
>>> never matched with a DREP or canceled, but holds a reference on the
>>> CEP.
>>>
>>> Until the DREQ times out (time depends on connection, but easily
>>> up to a minute), attempts to destroy the CEP are blocked.

>2) what type/degree of testing has been successfully passed with this patch
>applied?

It passes using the test that I used to discover and diagnose the problem, which
is ndconn.  I also ran with all other ND tests, several dapl tests, librdmacm
samples, and Intel MPI.




More information about the ofw mailing list