[ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.

Or Gerlitz ogerlitz at voltaire.com
Thu Apr 24 06:52:07 PDT 2008


Ira Weiny wrote:
> The symptom is that nodes drop out of the IPoIB mcast group after a node
> temporarily goes catatonic.  The details are:
>
>    1) Issues on a node cause a soft lockup of the node.
>    2) OpenSM does a normal light sweep.
>    3) MADs to the node time out since the node is in a "bad state"
>    4) OpenSM marks the node down and drops it from internal tables, including
>       mcast groups.
>    5) Node recovers from soft lock up condition.
>    6) A subsequent sweep causes OpenSM see the node and add it back to the
>       fabric.
As Hal noted, client reregister is the way to go.

In a similar discussion in the past the conclusion was that the SM 
should (maybe even according to the spec, but according to common sense 
is fine as well, I think) set the re-register bit where in that case 
IPoIB rejoins and we are done. At the time, I understood that openSM 
would do so 
(http://lists.openfabrics.org/pipermail/general/2007-September/041237.html), 
am I wrong, or maybe the case brought on that thread (switch/port going 
down and a whole sub fabric is removed from the SM point of view where 
the links remain up from the view point of the nodes) was different? the 
basic point is a case where a node link is UP and the SM lost this node 
for some time and now sees it again. We used to call it "the 
active/active" transition and an SM maybe need special logic for it.

Or.




More information about the general mailing list