[openib-general] [RFC] [PATCH 2/7] ib_multicast 2.6.20: add ib_multicast module to track join requests from the same port
Michael S. Tsirkin
mst at mellanox.co.il
Tue Oct 10 18:10:25 PDT 2006
Quoting r. Sean Hefty <sean.hefty at intel.com>:
> >And I'm worried about the extensive use of atomic operations
> >this patch introduces - both performance and race-condition-wise.
> >Can't we stick to simple locking? Replacing completion with a single
> >BUSY bit looks especially scary.
>
> I reworked the locking for this 3-4 times before I came up with something that I
> felt was simple, yet could still do the job.
> felt was simple, yet could still do the job. The difficulty is that multiple
> threads can try to join the same group at once, each with different join
> parameters, while existing members leave it.
No, I was only speaking about IPoIB part of the patch.
I didn't look at atomic usage inside the new module yet.
> Or a user can leave a group before
> their initial join request even completes. The latter is a problem even for a
> single user, such as ipoib, which it doesn't handle.
Sorry, I don't follow.
Could you please give a scenario where ipoib fails but multicast module
works?
> I really don't think the performance of the code is as much an issue versus the
> time required to configure the fabric.
You might be right. But I wander whether we'll regret it later that
we switched to the slower generic thing when we already had a stable,
streamlined version.
Speaking of which - there seems to be some liner scans of outstanding
requests which could be a scalability problem if there's a
large number of these, isn't that right?
> >> - mutex_lock(&mcast_mutex);
> >> + /* Clear the busy flag so we try again */
> >> + status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> >>
> >> + mutex_lock(&mcast_mutex);
> >> spin_lock_irq(&priv->lock);
> >> - mcast->query = NULL;
> >> -
> >> - if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) {
> >> - if (status == -ETIMEDOUT)
> >> - queue_work(ipoib_workqueue, &priv->mcast_task);
> >> - else
> >> - queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
> >> - mcast->backoff * HZ);
> >> - } else
> >> - complete(&mcast->done);
> >> + if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> >> + queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
> >> + mcast->backoff * HZ);
> >> spin_unlock_irq(&priv->lock);
> >> mutex_unlock(&mcast_mutex);
> >>
> >> - return;
> >> + return status;
> >> }
> >
> >We used to do complete last thing on mcast object, now you are
> >touching the object after IPOIB_MCAST_FLAG_BUSY is cleared, apparently.
>
> The patch removes the need to check both mcast->query and a flag to determine
> state. It only uses the flag. Can you clarify what issue you see with the
> above code?
We used to have a completion to signal no callbacks are running, and we set it
last thing. Now there seems to be a window after BUSY is clear but you are
still accessing mcast->backoff. No?
> >What prevents us from exiting while callbacks are in progress?
> >Basically same applies wherever we used to call wait_for_mcast_join.
>
> ib_free_multicast() blocks while any callbacks are running.
I don't follow. ib_free_multicast seems to be called only when we
leave all mcast groups. So it seems callbacks can now run
after we do stop_thread which I think will lead to crashes.
> >I also started to wander why do we need a new API for this at all?
> >Can't the sa module be fixed to refcount the mcast joins properly for us,
> >with minor or no API changes?
>
> The difference is that this allows the free to match up with exactly one join
> request. This will be needed for userspace support. Additionally, the
> callback remains active beyond the initial join, so that the multicast module
> can notify the user when an errors occur on the multicast group that requires
> re-joining.
But I still don't understand - if everyone must use the refcounted API,
why do we need a separate module and an exported API for something
that is just an implementation detail?
I also have the following question on the API:
> +struct ib_multicast {
> + struct ib_sa_mcmember_rec rec;
> + ib_sa_comp_mask comp_mask;
> + int (*callback)(int status,
> + struct ib_multicast *multicast);
> + void *context;
> +};
Multiple distinct mcmember_rec queries could get us the same mcast group.
Say MTU > 256 and MTU > 512 look different but actually will get
same group in practice.
Say 2 clients ask for these 2 queries.
What will be in the ib_sa_mcmember_rec in this case?
--
MST
More information about the general
mailing list