[openib-general] [RFC] [PATCH 2/7] ib_multicast 2.6.20: add ib_multicast module to track join requests from the same port

Tue Oct 10 17:26:01 PDT 2006

>It would make some sense to keep this API separate if the mcast module would
>only be targeted at userspace users - kernel consumers likely
>do not share mcast groups, so they could avoid the overhead.
>But since this patch seems to move all kernels users to the new API to -
>why does this need to be a separate module at all?

I view the ib_sa API as simply sending the MAD to the SA and routing back the
response.  That functionality is needed, and is distinct enough, that I would
keep it separate from the multicast module.

Kernel consumers can only avoid the overhead if we never allow a userspace
application to share the same multicast group.  I don't think that we need to do
that, as long as the app has the right permissions.

The rdma_cm also uses the cached copy of the ipoib multicast group when forming
new groups.

>And I'm worried about the extensive use of atomic operations
>this patch introduces - both performance and race-condition-wise.
>Can't we stick to simple locking? Replacing completion with a single
>BUSY bit looks especially scary.

I reworked the locking for this 3-4 times before I came up with something that I
felt was simple, yet could still do the job.  The difficulty is that multiple
threads can try to join the same group at once, each with different join
parameters, while existing members leave it.  Or a user can leave a group before
their initial join request even completes.  The latter is a problem even for a
single user, such as ipoib, which it doesn't handle.

I really don't think the performance of the code is as much an issue versus the
time required to configure the fabric.

>> -	mutex_lock(&mcast_mutex);
>> +	/* Clear the busy flag so we try again */
>> +	status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>>
>> +	mutex_lock(&mcast_mutex);
>>  	spin_lock_irq(&priv->lock);
>> -	mcast->query = NULL;
>> -
>> -	if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) {
>> -		if (status == -ETIMEDOUT)
>> -			queue_work(ipoib_workqueue, &priv->mcast_task);
>> -		else
>> -			queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
>> -					   mcast->backoff * HZ);
>> -	} else
>> -		complete(&mcast->done);
>> +	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
>> +		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
>> +				   mcast->backoff * HZ);
>>  	spin_unlock_irq(&priv->lock);
>>  	mutex_unlock(&mcast_mutex);
>>
>> -	return;
>> +	return status;
>>  }
>
>We used to do complete last thing on mcast object, now you are
>touching the object after IPOIB_MCAST_FLAG_BUSY is cleared, apparently.

The patch removes the need to check both mcast->query and a flag to determine
state.  It only uses the flag.  Can you clarify what issue you see with the
above code?

>What prevents us from exiting while callbacks are in progress?
>Basically same applies wherever we used to call wait_for_mcast_join.

ib_free_multicast() blocks while any callbacks are running.

>> +struct ib_multicast *ib_join_multicast(struct ib_device *device, u8
>port_num,
>> +				       struct ib_sa_mcmember_rec *rec,
>> +				       ib_sa_comp_mask comp_mask, gfp_t
gfp_mask,
>> +				       int (*callback)(int status,
>> +						       struct ib_multicast
>> +							      *multicast),
>> +				       void *context);
>> +
>
>Is this re-introducing module unload races we had with sa all over again?

The call returns a structure that must be freed.  If the structure is freed by
returning a non-zero call to the callback, then we have the same problem that
ib_cm and rdma_cm have.  Not allowing a return value from the callback is an
easy fix for that though.

>I also started to wander why do we need a new API for this at all?
>Can't the sa module be fixed to refcount the mcast joins properly for us,
>with minor or no API changes?

The difference is that this allows the free to match up with exactly one join
request.  This will be needed for userspace support.  Additionally, the callback
remains active beyond the initial join, so that the multicast module can notify
the user when an errors occur on the multicast group that requires re-joining.

- Sean