[ofw] CM ref counting issues...

Sean Hefty sean.hefty at intel.com
Mon Dec 7 16:27:26 PST 2009


>Running ndconn over winverbs results in the IB CM hanging.  I'm running the
>client and server on a single system, and both sides hang during cleanup.  It's
>possible there is a problem in the winverbs driver, but it looks more like some
>issue in ibal to me.

After more debugging, the problem appears to be related to some sort of
deadlock.

>kal_cep_destroy FFFFFA80044F3BB0 0x185 ref 0x1 signal 0
>cm_destroy_id 0x102 cid 0x185
>
>^^^ This and below indicate a reference counting issue.
>The extra reference should be an outstanding MAD that never
>releases its reference.  (Increasing the wait timeout doesn't help.)

I further modified the code.  Now when cm_destroy_id times out (after waiting 20
seconds), I mark that the id has been destroyed, but do not free it.

In the destroy callback, I check to see if the callback is for an id that has
been marked as destroyed.  In all cases, the destroy callback was invoked for
id's that had timed out and were destroyed.

Winverbs should be in the thread context for an application exit, but it also
makes use of the system delayed work queue.  I'm not sure what thread contexts
are required of the IB CM, specifically the thread that processes timewaits.

- Sean




More information about the ofw mailing list