[ofa-general] Re: potential device removal deadlock
Steve Wise
swise at opengridcomputing.com
Mon Jan 26 11:08:35 PST 2009
Roland Dreier wrote:
> > I'm looking at the rdma_[u]cm modules and how they generate
> > DEVICE_REMOVAL events to user applications, and I see a potential
> > deadlock. ib_unregister_device() calls the ib_client remove()
> > functions in the reverse order from which the ib_clients were
> > registered. And if you look at ib_uverbs_remove_one(), you'll see it
> > will block until all references from user apps are released. So if
> > ib_uverbs remove() gets called _before_ the rdma_cm remove() function,
> > then the unregister process will deadlock since applications don't get
> > notification of the device removal.
> >
> > Am I missing something, or is this a bug?
>
> Yes, looks that way. Making sure that rdma_cm is loaded after ib_uverbs
> works around it.
>
How could we fix this in the kernel? Perhaps ib_uverbs should post an
async error analgous to RDMA_CM_EVENT_DEVICE_REMOVAL?
Maybe IB_EVENT_DEVICE_FATAL?
In the case of EEH support of iw_cxgb3, I guess the driver could post
this event. That would at least kick all the user apps...
> > I would think ib_uverbs should actually blow away the kernel parts of
> > the user's handles allowing the device to be removed. Then the user
> > app will discover things went south on the next down call into the
> > uverbs code -or- by the DEVICE_REMOVAL rdma-cm event.
>
> Yes, but that's not that easy (eg need to shoot down mappings of PCI
> memory into all userspace processes, etc)... we punted on it when adding
> device removal support to uverbs.
>
>
This makes EEH support pretty painful.
Stevo
More information about the general
mailing list