[ofa-general] Re: potential device removal deadlock

Steve Wise swise at opengridcomputing.com
Mon Jan 26 11:08:35 PST 2009


Roland Dreier wrote:
>  > I'm looking at the rdma_[u]cm modules and how they generate
>  > DEVICE_REMOVAL events to user applications, and I see a potential
>  > deadlock.  ib_unregister_device() calls the ib_client remove()
>  > functions in the reverse order from which the ib_clients were
>  > registered.  And if you look at ib_uverbs_remove_one(), you'll see it
>  > will block until all references from user apps are released.  So if
>  > ib_uverbs remove() gets called _before_ the rdma_cm remove() function,
>  > then the unregister process will deadlock since applications don't get
>  > notification of the device removal.
>  > 
>  > Am I missing something, or is this a bug? 
>
> Yes, looks that way.  Making sure that rdma_cm is loaded after ib_uverbs
> works around it.
>   


How could we fix this in the kernel? Perhaps ib_uverbs should post an 
async error analgous to RDMA_CM_EVENT_DEVICE_REMOVAL?

Maybe IB_EVENT_DEVICE_FATAL?

In the case of EEH support of iw_cxgb3, I guess the driver could post 
this event. That would at least kick all the user apps...

>  > I would think ib_uverbs should actually blow away the kernel parts of
>  > the user's handles allowing the device to be removed.  Then the user
>  > app will discover things went south on the next down call into the
>  > uverbs code -or- by the DEVICE_REMOVAL rdma-cm event.
>
> Yes, but that's not that easy (eg need to shoot down mappings of PCI
> memory into all userspace processes, etc)... we punted on it when adding
> device removal support to uverbs.
>
>   

This makes EEH support pretty painful.

Stevo





More information about the general mailing list