[ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.
Or Gerlitz
ogerlitz at voltaire.com
Thu Apr 24 06:52:07 PDT 2008
Ira Weiny wrote:
> The symptom is that nodes drop out of the IPoIB mcast group after a node
> temporarily goes catatonic. The details are:
>
> 1) Issues on a node cause a soft lockup of the node.
> 2) OpenSM does a normal light sweep.
> 3) MADs to the node time out since the node is in a "bad state"
> 4) OpenSM marks the node down and drops it from internal tables, including
> mcast groups.
> 5) Node recovers from soft lock up condition.
> 6) A subsequent sweep causes OpenSM see the node and add it back to the
> fabric.
As Hal noted, client reregister is the way to go.
In a similar discussion in the past the conclusion was that the SM
should (maybe even according to the spec, but according to common sense
is fine as well, I think) set the re-register bit where in that case
IPoIB rejoins and we are done. At the time, I understood that openSM
would do so
(http://lists.openfabrics.org/pipermail/general/2007-September/041237.html),
am I wrong, or maybe the case brought on that thread (switch/port going
down and a whole sub fabric is removed from the SM point of view where
the links remain up from the view point of the nodes) was different? the
basic point is a case where a node link is UP and the SM lost this node
for some time and now sees it again. We used to call it "the
active/active" transition and an SM maybe need special logic for it.
Or.
More information about the general
mailing list