[ofa-general] Re: [PATCH] mlx4: Do not allow ib userspace open while device is being removed

Roland Dreier rdreier at cisco.com
Mon Jul 13 12:51:06 PDT 2009


 > Userspace apps are supposed to release all ib device resources if
 > they receive a fatal async event (IBV_EVENT_DEVICE_FATAL).  However,
 > the app has no way of knowing when the device has come back up, except
 > to repeatedly attempt ibv_open_device() until it succeeds.
 > 
 > However, currently there is no protection against open succeeding when
 > the device is in the midst of the removal following the fatal event.
 > In this case, the open will succeed, but as a result the device waits
 > in the middle of its removal until the new app releases its ib resources
 >  -- and the new app will not do so, since the open succeeded at a point
 > following the fatal event generation.

Does mthca have this bug too?

 > --- a/include/linux/mlx4/device.h
 > +++ b/include/linux/mlx4/device.h
 > @@ -379,6 +379,7 @@ struct mlx4_dev {
 >  	struct radix_tree_root	qp_table_tree;
 >  	u32			rev_id;
 >  	char			board_id[MLX4_BOARD_ID_LEN];
 > +	u32			ib_active;
 >  };
 >  
 >  struct mlx4_init_port_param {

I don't much like this approach of putting an IB-driver-specific flag
into the generic mlx4 data structure.  Why can't the IB driver manage
this in its own data structure?

 > @@ -698,6 +713,8 @@ static void mlx4_ib_remove(struct mlx4_dev *dev, void *ibdev_ptr)
 >  	struct mlx4_ib_dev *ibdev = ibdev_ptr;
 >  	int p;
 >  
 > +	dev->ib_active = 0;
 > +
 >  	mlx4_ib_mad_cleanup(ibdev);
 >  	ib_unregister_device(&ibdev->ib_dev);

This just seems silly -- what is setting ib_active to 0 protecting
against?  The ib_unregister_device() call is going make sure we have no
clients left anyway, isn't it?

 - R.



More information about the general mailing list