[openib-general] [RFC] [PATCH 2/7] ib_multicast 2.6.20: add ib_multicast module to track join requests from the same port

Tue Oct 10 18:10:25 PDT 2006

Quoting r. Sean Hefty <sean.hefty at intel.com>:
> >And I'm worried about the extensive use of atomic operations
> >this patch introduces - both performance and race-condition-wise.
> >Can't we stick to simple locking? Replacing completion with a single
> >BUSY bit looks especially scary.
> 
> I reworked the locking for this 3-4 times before I came up with something that I
> felt was simple, yet could still do the job.

> felt was simple, yet could still do the job.  The difficulty is that multiple
> threads can try to join the same group at once, each with different join
> parameters, while existing members leave it.

No, I was only speaking about IPoIB part of the patch.
I didn't look at atomic usage inside the new module yet.

> Or a user can leave a group before
> their initial join request even completes.  The latter is a problem even for a
> single user, such as ipoib, which it doesn't handle.

Sorry, I don't follow.
Could you please give a scenario where ipoib fails but multicast module
works?

> I really don't think the performance of the code is as much an issue versus the
> time required to configure the fabric.

You might be right. But I wander whether we'll regret it later that
we switched to the slower generic thing when we already had a stable,
streamlined version.

Speaking of which - there seems to be some liner scans of outstanding
requests which could be a scalability problem if there's a
large number of these, isn't that right?

> >> -	mutex_lock(&mcast_mutex);
> >> +	/* Clear the busy flag so we try again */
> >> +	status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> >>
> >> +	mutex_lock(&mcast_mutex);
> >>  	spin_lock_irq(&priv->lock);
> >> -	mcast->query = NULL;
> >> -
> >> -	if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) {
> >> -		if (status == -ETIMEDOUT)
> >> -			queue_work(ipoib_workqueue, &priv->mcast_task);
> >> -		else
> >> -			queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
> >> -					   mcast->backoff * HZ);
> >> -	} else
> >> -		complete(&mcast->done);
> >> +	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> >> +		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
> >> +				   mcast->backoff * HZ);
> >>  	spin_unlock_irq(&priv->lock);
> >>  	mutex_unlock(&mcast_mutex);
> >>
> >> -	return;
> >> +	return status;
> >>  }
> >
> >We used to do complete last thing on mcast object, now you are
> >touching the object after IPOIB_MCAST_FLAG_BUSY is cleared, apparently.
> 
> The patch removes the need to check both mcast->query and a flag to determine
> state.  It only uses the flag.  Can you clarify what issue you see with the
> above code?

We used to have a completion to signal no callbacks are running, and we set it
last thing.  Now there seems to be a window after BUSY is clear but you are
still accessing mcast->backoff. No?

> >What prevents us from exiting while callbacks are in progress?
> >Basically same applies wherever we used to call wait_for_mcast_join.
> 
> ib_free_multicast() blocks while any callbacks are running.

I don't follow. ib_free_multicast seems to be called only when we
leave all mcast groups. So it seems callbacks can now run
after we do stop_thread which I think will lead to crashes.

> >I also started to wander why do we need a new API for this at all?
> >Can't the sa module be fixed to refcount the mcast joins properly for us,
> >with minor or no API changes?
> 
> The difference is that this allows the free to match up with exactly one join
> request.  This will be needed for userspace support.  Additionally, the
> callback remains active beyond the initial join, so that the multicast module
> can notify the user when an errors occur on the multicast group that requires
> re-joining.

But I still don't understand - if everyone must use the refcounted API,
why do we need a separate module and an exported API for something
that is just an implementation detail?

I also have the following question on the API:

> +struct ib_multicast {
> +       struct ib_sa_mcmember_rec rec;
> +       ib_sa_comp_mask         comp_mask;
> +       int                     (*callback)(int status,
> +                                           struct ib_multicast *multicast);
> +       void                    *context;
> +};

Multiple distinct mcmember_rec queries could get us the same mcast group.
Say MTU > 256 and MTU > 512 look different but actually will get
same group in practice.

Say 2 clients ask for these 2 queries.
What will be in the ib_sa_mcmember_rec in this case?

-- 
MST