[ofa-general] Re: [PATCH] mlx4: Do not allow ib userspace open while device is being removed
Roland Dreier
rdreier at cisco.com
Mon Jul 13 12:51:06 PDT 2009
> Userspace apps are supposed to release all ib device resources if
> they receive a fatal async event (IBV_EVENT_DEVICE_FATAL). However,
> the app has no way of knowing when the device has come back up, except
> to repeatedly attempt ibv_open_device() until it succeeds.
>
> However, currently there is no protection against open succeeding when
> the device is in the midst of the removal following the fatal event.
> In this case, the open will succeed, but as a result the device waits
> in the middle of its removal until the new app releases its ib resources
> -- and the new app will not do so, since the open succeeded at a point
> following the fatal event generation.
Does mthca have this bug too?
> --- a/include/linux/mlx4/device.h
> +++ b/include/linux/mlx4/device.h
> @@ -379,6 +379,7 @@ struct mlx4_dev {
> struct radix_tree_root qp_table_tree;
> u32 rev_id;
> char board_id[MLX4_BOARD_ID_LEN];
> + u32 ib_active;
> };
>
> struct mlx4_init_port_param {
I don't much like this approach of putting an IB-driver-specific flag
into the generic mlx4 data structure. Why can't the IB driver manage
this in its own data structure?
> @@ -698,6 +713,8 @@ static void mlx4_ib_remove(struct mlx4_dev *dev, void *ibdev_ptr)
> struct mlx4_ib_dev *ibdev = ibdev_ptr;
> int p;
>
> + dev->ib_active = 0;
> +
> mlx4_ib_mad_cleanup(ibdev);
> ib_unregister_device(&ibdev->ib_dev);
This just seems silly -- what is setting ib_active to 0 protecting
against? The ib_unregister_device() call is going make sure we have no
clients left anyway, isn't it?
- R.
More information about the general
mailing list