[openib-general] RE: Failed multicast join withnew multicast module

Thu Jun 8 04:03:09 PDT 2006

On Wed, 2006-06-07 at 22:48, Sean Hefty wrote:
> >I might be missing your point but UD is unreliable so the sends can be
> >dropped. The delay/retry is to make sure the join does occur,
> 
> This is different than a dropped request or reply.  In this case, the receiver
> gets a reply, but it will be a failure from the SA to join the group.

By receiver, I think you are referring to SA requester. Yes, the SA
would reject the request with a status ERR_REQ_INSUFFICIENT_COMPONENTS.

> For example, a NonMember tries to re-join before a FullMember which would have
> created the group does.  The result is that requests that receive a reply also
> need to be retried, with the timeout dependent on some remote node in the fabric
> creating the group.

and it is unknown when such a multicast registration (to create the
group) would occur. So the proper timeout is unknown. That's why IPoIB
has a couple of different strategies for handling this depending on the
JoinState,

> >> So, the only safe thing to do is for all multicast clients to detach from all
> >> multicast groups, destroy all address handles,
> >
> >Why all groups ?
> 
> Because the SM has lost track that any groups in the fabric existed, so those
> groups must be recreated, all potentially with different mlids.

Yes, in the case of client reregister.

> >> possibly wait for a new group to be created, and then start all over again.
> >
> >Start what all over again ?
> 
> I meant attach the QP to the new group and allocate a new address handle.

Couldn't it modify the old one as an alternative strategy ?

> This is a general comment, and not directed at anyone specific,

Don't worry. I'm not taking it personally. Just want to give you my
$0.02 worth on what I think you are saying below:

> but is this
> really the architecture and implementation that we want to aim for?  I really
> think that we need to look at solutions that don't break existing communication,
> unless the links providing that communication actually go down, even if this
> means extending the architecture.

If this comment is directed at client reregister mechanism, you should
note that when this was brought up there was resistance to it based on
the recommendation (probably not a strong enough word for this) that SMs
be redundant in the subnet. There was a fair bit of anecdotal evidence
that this was not how they were being used at the time but it may have
been a chicken and egg problem.

-- Hal

> - Sean