[ofa-general] potential device removal deadlock

Steve Wise swise at opengridcomputing.com
Mon Jan 26 09:43:14 PST 2009


Hey Roland/Sean,

I'm looking at the rdma_[u]cm modules and how they generate 
DEVICE_REMOVAL events to user applications, and I see a potential 
deadlock.  ib_unregister_device() calls the ib_client remove() functions 
in the reverse order from which the ib_clients were registered.  And if 
you look at ib_uverbs_remove_one(), you'll see it will block until all 
references from user apps are released.  So if ib_uverbs remove() gets 
called _before_ the rdma_cm remove() function, then the unregister 
process will deadlock since applications don't get notification of the 
device removal.

Am I missing something, or is this a bug? 

I would think ib_uverbs should actually blow away the kernel parts of 
the user's handles allowing the device to be removed.  Then the user app 
will discover things went south on the next down call into the uverbs 
code -or- by the DEVICE_REMOVAL rdma-cm event.

I'm thinking about all this in the context of EEH handling for cxgb3.


Thoughts?

Thanks,

Steve.






More information about the general mailing list