[ofa-general] rping / librdmacm deadlock question

Steve Wise swise at opengridcomputing.com
Wed Jul 18 12:03:20 PDT 2007


Sean Hefty wrote:
>> I have, I believe, traced this to a deadlock between ib_destroy_qp() 
>> and ucma_close(). It looks like librdmacm has a ((destructor)) 
>> function defined that results in a call to ibv_device_close() and 
>> ultimately in <device>::destroy_qp().   That seems reasonable, and it 
>> all happens as the OS unloads the application.
>> However, it is (I believe) happening before the "rdma_cm" device file 
>> descriptor is 'closed' by the OS as the application terminates.  
>> [rdma_destroy_event_channel() would normally do this, but it doesn't 
>> get called when the application is interrupted by SIGINT.]
> 
> This seems like an iWarp specific issue caused by the following code in 
> iw_cm_connect():
> 
>     /* Get the ib_qp given the QPN */
>     qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
>     if (!qp) {
>         spin_unlock_irqrestore(&cm_id_priv->lock, flags);
>         return -EINVAL;
>     }
>     cm_id->device->iwcm->add_ref(qp);
> 
> I think the reference is normally removed in cm_close_handler:
> 
>     if (cm_id_priv->qp) {
>         cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
>         cm_id_priv->qp = NULL;
>     }
> 
> 
> The upstream iWarp drivers must already be able to handle this 
> situation, or I'm sure we would have seen the problem before.  I'm just 
> not familiar enough with the iWarp drivers to see what they do to handle 
>  it.  I'll continue reading through the code, but maybe Steve can 
> explain how to avoid the problem.
> 
> I wonder if it would be better if the iWarp CM acquired/released the QP 
> reference on a per call basis, rather than holding a reference 
> throughout the entire connection.
> 

The design assume the iwcm can hold this reference and cache the qp ptr. 
    In the iwarp design, the cm_id (connection) and qp are tighly bound 
once the connection is transitioned into rdma mode.  This is different 
than infiniband.

I still don't see the deadlock?


Steve.



More information about the general mailing list