[ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.

Ira Weiny weiny2 at llnl.gov
Wed Apr 23 13:38:16 PDT 2008


Hey all,

We have just started to experience a situation which I don't think is strictly
a bug but I think could be fixed within the OFED software.

The symptom is that nodes drop out of the IPoIB mcast group after a node
temporarily goes catatonic.  The details are:

   1) Issues on a node cause a soft lockup of the node.
   2) OpenSM does a normal light sweep.
   3) MADs to the node time out since the node is in a "bad state"
   4) OpenSM marks the node down and drops it from internal tables, including
      mcast groups.
   5) Node recovers from soft lock up condition.
   6) A subsequent sweep causes OpenSM see the node and add it back to the
      fabric.
   7) Node is fully functional on the verbs layer but IPoIB never knew anything
      was wrong so it does _not_ rejoin the mcast groups.  (This is different
      from the condition where the link actually goes down.)

As far as we can see there is nothing wrong with the node.  It just went
catatonic for a while.  Obviously this is not a good condition, however, I was
thinking of a couple of things which could be done to "fix" the above
situation.  I am writing here to see which solution might be best, and accepted
by the community.  Alternatively this may have already been addressed.
However, I don't see a bug in the bug list, nor do I find anything in the
archive.

Solutions I can think of are:

   A) Modify OpenSM to move the node to a "questionable" state for a period of X
      sweeps.  If after X sweeps the node still does not respond, drop it.  If
      the node does respond return it to it's original state.
   B) When OpenSM queries the node as if it is new on the fabric and the SMA
      "thinks" it is not new, have the SMA detect this and notify the IPoIB
      layer (or ULPs in general) that something has gone wrong.  The IPoIB
      layer could then check/rejoin the group.
   C) put some code in IPoIB which might detect "lost cycles" and check/rejoin
      the mcast group.

I have not worked out details for any solution.  I believe that A and B are
"outside the spec".  However, I can see merit in A and B.

Solution A would help if MAD's are lost due to reasons other than node issues.
(Perhaps a bad link.  Although I don't know of anyone having problems like
that.)

Solution B puts the solution closer to the original problem but I am unsure how
the SMA would know what is going on.

Solution C is really close to the problem however I don't know how it would be
done.  I do think that this would be within the specification as it really is
the ULP's job to maintain its membership in the group.  But how would it do
this without help from the lower layers.  (Of course it could poll for
membership but I think that is a bad idea.)


Thoughts?
Ira Weiny
Lawrence Livermore National Lab
weiny2 at llnl.gov




More information about the general mailing list