[openib-general] CMA deadlock

Mon Apr 3 01:20:56 PDT 2006

Sean, in case of errors I am seeing hangs in CMA which I think I have tracked
to the following deadlock scenario:

  A ULP requests address resolution; on success requests route resolution;
  route resolution succeeds; inside the callback ULP requests rdma_connect.
  Now, a failure (e.g. out of memory) occurs at ULP level and so it decides to
  destroy the ID. To this end it returns failure code from the route callback.

  Note that route resolution callback runs in the per-port MAD workqueue.

  Now, CMA will call rdma_destroy_id to destroy the ID. Since CM ID exists, it
  will try to destroy it.

  This might deadlock: since a CM MAD (REQ) has been created, CM ID destroy will
  now block, waiting for the MAD to be freed, but MADs might not complete since
  we are blocking the MAD workqueue.

A possible solution could be to bounce the SA query callback out to
the rdma WQ. Does this make sense?

Further, a comment in ib_cm.h says:

 * Users may not call ib_destroy_cm_id while in the context of this callback;
 * however, returning a non-zero value instructs the communication manager to
 * destroy the @cm_id after the callback completes.

And it seems that, if the user callback returns failure, the CMA actually calls
rdma_destroy_id which in turn may call ib_destroy_cm_id from inside the CM
callback. I think this might deadlock in a similiar way.  Again, bouncing the CM
event to the rdma WQ will solve this I think.

Sean, could you look at this please?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies