[ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.

Ira Weiny weiny2 at llnl.gov
Thu Apr 24 14:31:25 PDT 2008


On Thu, 24 Apr 2008 16:52:07 +0300
Or Gerlitz <ogerlitz at voltaire.com> wrote:

> Ira Weiny wrote:
> > The symptom is that nodes drop out of the IPoIB mcast group after a node
> > temporarily goes catatonic.  The details are:
> >
> >    1) Issues on a node cause a soft lockup of the node.
> >    2) OpenSM does a normal light sweep.
> >    3) MADs to the node time out since the node is in a "bad state"
> >    4) OpenSM marks the node down and drops it from internal tables, including
> >       mcast groups.
> >    5) Node recovers from soft lock up condition.
> >    6) A subsequent sweep causes OpenSM see the node and add it back to the
> >       fabric.
> As Hal noted, client reregister is the way to go.
> 
> In a similar discussion in the past the conclusion was that the SM 
> should (maybe even according to the spec, but according to common sense 
> is fine as well, I think) set the re-register bit where in that case 
> IPoIB rejoins and we are done. At the time, I understood that openSM 
> would do so 
> (http://lists.openfabrics.org/pipermail/general/2007-September/041237.html), 
> am I wrong, or maybe the case brought on that thread (switch/port going 
> down and a whole sub fabric is removed from the SM point of view where 
> the links remain up from the view point of the nodes) was different? the 
> basic point is a case where a node link is UP and the SM lost this node 
> for some time and now sees it again. We used to call it "the 
> active/active" transition and an SM maybe need special logic for it.
> 

I have set up the following as a test situation

        switch B
       /       \ (link X)
   switch A   switch C
    /           /   \
 Node1      node2  node3
  (SM)

When I down link X and re-enable it node 2 and 3 do _not_ rejoin the mcast
group.

Debug output from OpenSM indicates it is setting the rereg bit but I don't see
the rejoin in the debug output from the node 2's IPoIB mcast layer.  Perhaps
there is a bug to be squashed here?

Just in case anyone is curious, this is with OFED 1.2.5 on a RHEL 5.1 based
kernel, and OpenSM 3.2.1-8341058-dirty.

I am in the process of tracking this down,
Ira




More information about the general mailing list