[ewg] Re: [PATCHv4] IB/ipoib: Fix ipoib handling for pkey reordering

Thu Mar 29 08:07:32 PDT 2007

> Quoting Moni Levy <myopenib at gmail.com>:
> Subject: [PATCHv4] IB/ipoib: Fix ipoib handling for pkey reordering
> 
> This issue was found during partitioning & SM fail over testing. The fix was
> tested over the weekend with pkey reshuffling, removal and addition every few
> seconds concurrent with OFED restart.

But probably not together with the patch below, right?
I just came up with the proposal yesterday, it seems unlikely
the patch could be tested over the weekend ...

> Please look at the "IB/cache: Add
> ib_cache report for cache in process" patch also.
> 
> Changes from v1:
>         * added flush flag to ipoib_ib_dev_stop(), ipoib_ib_dev_down() alike
>         * fixed a bug in device extraction from the work struct
>         * removed some warnings in case they are caused due to missing PKEY as 
>           this seems like a valid flow now.

Here's an idea:

Instead of adding yet another flag to ipoib_ib_dev_stop and friends, and
worrying about potential races when ipoib_ib_dev_stop is run from both ipoib
workqueue and another thread, how about always making them *not* flush, and
using a queue + flush combination when they need to be run not in ipoib work
queue?

Roland, what do you think?

> @@ -232,9 +232,10 @@ static int ipoib_mcast_join_finish(struc
>  		ret = ipoib_mcast_attach(dev, be16_to_cpu(mcast->mcmember.mlid),
>  					 &mcast->mcmember.mgid);
>  		if (ret < 0) {
> -			ipoib_warn(priv, "couldn't attach QP to multicast group "
> -				   IPOIB_GID_FMT "\n",
> -				   IPOIB_GID_ARG(mcast->mcmember.mgid));
> +			if (ret != -ENXIO) /* No pkey found */
> +				ipoib_warn(priv, "couldn't attach QP to multicast group "
> +					   IPOIB_GID_FMT "\n",
> +					   IPOIB_GID_ARG(mcast->mcmember.mgid));
>  
>  			clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags);
>  			return ret;

I forgot why are we checking for this ENXIO error - isn't this
because cache updates where out of sync with port events?
So maybe we can get rid of this now?

BTW, shouldn't there be some code testing return code for -ESTALE
and retrying later? What am I missing?

-- 
MST