[ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.
ogerlitz at voltaire.com
Thu Apr 24 06:52:07 PDT 2008
Ira Weiny wrote:
> The symptom is that nodes drop out of the IPoIB mcast group after a node
> temporarily goes catatonic. The details are:
> 1) Issues on a node cause a soft lockup of the node.
> 2) OpenSM does a normal light sweep.
> 3) MADs to the node time out since the node is in a "bad state"
> 4) OpenSM marks the node down and drops it from internal tables, including
> mcast groups.
> 5) Node recovers from soft lock up condition.
> 6) A subsequent sweep causes OpenSM see the node and add it back to the
As Hal noted, client reregister is the way to go.
In a similar discussion in the past the conclusion was that the SM
should (maybe even according to the spec, but according to common sense
is fine as well, I think) set the re-register bit where in that case
IPoIB rejoins and we are done. At the time, I understood that openSM
would do so
am I wrong, or maybe the case brought on that thread (switch/port going
down and a whole sub fabric is removed from the SM point of view where
the links remain up from the view point of the nodes) was different? the
basic point is a case where a node link is UP and the SM lost this node
for some time and now sees it again. We used to call it "the
active/active" transition and an SM maybe need special logic for it.
More information about the general