[ofa-general] rping / librdmacm deadlock question

Steve Wise swise at opengridcomputing.com
Wed Jul 18 12:34:17 PDT 2007


Steve Wise wrote:
> Sean Hefty wrote:
>>> I have, I believe, traced this to a deadlock between ib_destroy_qp() 
>>> and ucma_close(). It looks like librdmacm has a ((destructor)) 
>>> function defined that results in a call to ibv_device_close() and 
>>> ultimately in <device>::destroy_qp().   That seems reasonable, and it 
>>> all happens as the OS unloads the application.
>>> However, it is (I believe) happening before the "rdma_cm" device file 
>>> descriptor is 'closed' by the OS as the application terminates.  
>>> [rdma_destroy_event_channel() would normally do this, but it doesn't 
>>> get called when the application is interrupted by SIGINT.]
>>
>> This seems like an iWarp specific issue caused by the following code 
>> in iw_cm_connect():
>>
>>     /* Get the ib_qp given the QPN */
>>     qp = cm_id->device->iwcm->get_qp(cm_id->device, iw_param->qpn);
>>     if (!qp) {
>>         spin_unlock_irqrestore(&cm_id_priv->lock, flags);
>>         return -EINVAL;
>>     }
>>     cm_id->device->iwcm->add_ref(qp);
>>
>> I think the reference is normally removed in cm_close_handler:
>>
>>     if (cm_id_priv->qp) {
>>         cm_id_priv->id.device->iwcm->rem_ref(cm_id_priv->qp);
>>         cm_id_priv->qp = NULL;
>>     }
>>
>>
>> The upstream iWarp drivers must already be able to handle this 
>> situation, or I'm sure we would have seen the problem before.  I'm 
>> just not familiar enough with the iWarp drivers to see what they do to 
>> handle  it.  I'll continue reading through the code, but maybe Steve 
>> can explain how to avoid the problem.
>>
>> I wonder if it would be better if the iWarp CM acquired/released the 
>> QP reference on a per call basis, rather than holding a reference 
>> throughout the entire connection.
>>
> 
> The design assume the iwcm can hold this reference and cache the qp ptr. 
>    In the iwarp design, the cm_id (connection) and qp are tighly bound 
> once the connection is transitioned into rdma mode.  This is different 
> than infiniband.
> 
> I still don't see the deadlock?
> 

I've re-read this thread and I think I've posted the answers for Clark...

Steve.

> 
> Steve.
> 




More information about the general mailing list