[ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.

Wed Apr 23 17:05:14 PDT 2008

On Wed, 2008-04-23 at 13:38 -0700, Ira Weiny wrote:
> Hey all,
> 
> We have just started to experience a situation which I don't think is strictly
> a bug but I think could be fixed within the OFED software.
> 
> The symptom is that nodes drop out of the IPoIB mcast group after a node
> temporarily goes catatonic.  The details are:
> 
>    1) Issues on a node cause a soft lockup of the node.
>    2) OpenSM does a normal light sweep.
>    3) MADs to the node time out since the node is in a "bad state"
>    4) OpenSM marks the node down and drops it from internal tables, including
>       mcast groups.
>    5) Node recovers from soft lock up condition.
>    6) A subsequent sweep causes OpenSM see the node and add it back to the
>       fabric.
>    7) Node is fully functional on the verbs layer but IPoIB never knew anything
>       was wrong so it does _not_ rejoin the mcast groups.  (This is different
>       from the condition where the link actually goes down.)
> 
> As far as we can see there is nothing wrong with the node.  It just went
> catatonic for a while.  Obviously this is not a good condition, however, I was
> thinking of a couple of things which could be done to "fix" the above
> situation.  I am writing here to see which solution might be best, and accepted
> by the community.  Alternatively this may have already been addressed.
> However, I don't see a bug in the bug list, nor do I find anything in the
> archive.
> 
> Solutions I can think of are:
> 
>    A) Modify OpenSM to move the node to a "questionable" state for a period of X
>       sweeps.  If after X sweeps the node still does not respond, drop it.  If
>       the node does respond return it to it's original state.
>    B) When OpenSM queries the node as if it is new on the fabric and the SMA
>       "thinks" it is not new, have the SMA detect this and notify the IPoIB
>       layer (or ULPs in general) that something has gone wrong.  The IPoIB
>       layer could then check/rejoin the group.
>    C) put some code in IPoIB which might detect "lost cycles" and check/rejoin
>       the mcast group.
> 
> I have not worked out details for any solution.  I believe that A and B are
> "outside the spec".  However, I can see merit in A and B.
> 
> Solution A would help if MAD's are lost due to reasons other than node issues.
> (Perhaps a bad link.  Although I don't know of anyone having problems like
> that.)
> 
> Solution B puts the solution closer to the original problem but I am unsure how
> the SMA would know what is going on.
> 
> Solution C is really close to the problem however I don't know how it would be
> done.  I do think that this would be within the specification as it really is
> the ULP's job to maintain its membership in the group.  But how would it do
> this without help from the lower layers.  (Of course it could poll for
> membership but I think that is a bad idea.)

> Thoughts?

Having OpenSM request client reregistration (used in other places by
OpenSM) of such nodes will resolve this issue. As little or as much
policy can be built into OpenSM in determining "such" nodes to scope
down the application of this mechanism for this case.

-- Hal

> Ira Weiny
> Lawrence Livermore National Lab
> weiny2 at llnl.gov
> 
> _______________________________________________
> general mailing list
> general at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general