[openib-general] RE: Failed multicast join withnew multicast module

Wed Jun 7 19:48:42 PDT 2006

>I might be missing your point but UD is unreliable so the sends can be
>dropped. The delay/retry is to make sure the join does occur,

This is different than a dropped request or reply.  In this case, the receiver
gets a reply, but it will be a failure from the SA to join the group.  For
example, a NonMember tries to re-join before a FullMember which would have
created the group does.  The result is that requests that receive a reply also
need to be retried, with the timeout dependent on some remote node in the fabric
creating the group.

>> So, the only safe thing to do is for all multicast clients to detach from all
>> multicast groups, destroy all address handles,
>
>Why all groups ?

Because the SM has lost track that any groups in the fabric existed, so those
groups must be recreated, all potentially with different mlids.

>> possibly wait for a new group to be created, and then start all over again.
>
>Start what all over again ?

I meant attach the QP to the new group and allocate a new address handle.

This is a general comment, and not directed at anyone specific, but is this
really the architecture and implementation that we want to aim for?  I really
think that we need to look at solutions that don't break existing communication,
unless the links providing that communication actually go down, even if this
means extending the architecture.

- Sean