[Openib-windows] races on __destrot_obj function

Yossi Leybovich sleybo at mellanox.co.il
Mon Jul 10 08:44:55 PDT 2006


 
Fab 

Look at the following scenario

call to ipoib_port_destroy that call to its destroy obj function.
The destroy function send all objects in the child's list to destroy
(endpnt use ASYNC destroy)
The port finish his destroy function and wait till ref_cnt = 0

Now while the endpnt try to destroy itself it calls
ipoib_port_resum,(endpnt_cleanup) in ipoib_port_resume in case of MCAST
packet the code create endpnt and try to join to the mcast group.
This new end point is added to the port child list and take reference
Now the port ref_cnt will never be 0 ---> deadlock

Here is the prints I generate on one of our machines.


~1:[IPoIB]:__endpt_mgr_reset_all() [
~1:[IPoIB]:__endpt_mgr_reset_all() ]
~1:[IPoIB]:__endpt_mgr_remove_all() [
~1:[IPoIB]:__endpt_mgr_remove_all() ]
~1:[IPoIB]:__endpt_destroying() [
~1:[IPoIB]:__endpt_destroying() ]
~1:[IPoIB]:__endpt_cleanup() [
~1:[IPoIB]:ipoib_port_resume(): ipoib_resume.....
~1:[IPoIB]:__endpt_destroying() [
~1:[IPoIB]:__endpt_destroying() ]
~1:[IPoIB]:__endpt_mgr_ref() [
~1:[IPoIB]:__endpt_mgr_ref(): Look for :	  MAC: 01-00-5E-00-00-16
~1:[IPoIB]:__endpt_mgr_ref(): Failed endpoint lookup.
~1:[IPoIB]:__endpt_mgr_ref() ]
~1:[IPoIB]:ipoib_port_join_mcast() [
~1:[IPoIB]:__endpt_mgr_ref() [
~1:[IPoIB]:__endpt_mgr_ref(): Look for :	  MAC: 01-00-5E-00-00-16
~1:[IPoIB]:__endpt_mgr_ref(): Failed endpoint lookup.
~1:[IPoIB]:__endpt_mgr_ref() ]
~1:[IPoIB]:ipoib_endpt_create() [
~1:[IPoIB]:ipoib_endpt_create() ]
~1:[IPoIB]:__endpt_mgr_insert_locked() [
~1:[IPoIB]:__endpt_mgr_insert_locked(): insert  :	  MAC:
01-00-5E-00-00-16
~1:[IPoIB]:__endpt_mgr_insert() [
~1:[IPoIB]:__endpt_mgr_insert() !ERROR!: take ref  type 12 ref_cnt 8
~1:[IPoIB]:__endpt_mgr_insert() ]
~1:[IPoIB]:__endpt_mgr_insert_locked() ]
~1:[IPoIB]:ipoib_port_ref() !ERROR!: take ref type 11 ref_cnt 9
~1:[IPoIB]:ipoib_port_join_mcast() ]
~1:[IPoIB]:__endpt_destroying() [
~1:[IPoIB]:__endpt_destroying() ]
~1:[IPoIB]:__endpt_destroying() [
~1:[IPoIB]:__endpt_cleanup() ]


You can see that while we are trying to destroy everything the
ipoib_port_resume create new endpnt
I think that the proper fix for that is in 
cl_obj_insert_rel_parent_locked

Need to exit if the parent state != CL_INITIALIZED. (same as we do in
al_obj).
I created patch that check status when adding obj to the child list and
if the parent is not in CL_INITIALIZED it return error
This effect endpt_insert and endpt_insert_locked which now return
status.

I also remove the port_resume from the destruction of endpnt and move it
to the destruction of the port.
I did it because after the first change the port is being destroyed
before the endpnt and port_resume get NULL object for the port.
I am  testing the patch now to see its affect.

Do you have any comments?

Yossi 










More information about the ofw mailing list