[ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.
Sasha Khapyorsky
sashak at voltaire.com
Sun Apr 27 10:11:40 PDT 2008
Hi Ira,
On 13:38 Wed 23 Apr , Ira Weiny wrote:
>
> The symptom is that nodes drop out of the IPoIB mcast group after a node
> temporarily goes catatonic. The details are:
>
> 1) Issues on a node cause a soft lockup of the node.
> 2) OpenSM does a normal light sweep.
> 3) MADs to the node time out since the node is in a "bad state"
Normally during light sweep OpenSM will not query nodes. I think OpenSM
should not detect such soft lockup unless ib link state was changed and
heavy sweep was triggered. Is this the case?
> 4) OpenSM marks the node down and drops it from internal tables, including
> mcast groups.
> 5) Node recovers from soft lock up condition.
> 6) A subsequent sweep causes OpenSM see the node and add it back to the
> fabric.
> 7) Node is fully functional on the verbs layer but IPoIB never knew anything
> was wrong so it does _not_ rejoin the mcast groups. (This is different
> from the condition where the link actually goes down.)
If my approach above is correct it should be same as port down/up
handling. And as was noted already in this thread OpenSM should ask
for reregistration (by setting client reregistration bit).
I see your patch - seems this part is buggy in OpenSM now, will see
closer to this.
Sasha
More information about the general
mailing list