[ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.

Sasha Khapyorsky sashak at voltaire.com
Sun Apr 27 10:11:40 PDT 2008


Hi Ira,

On 13:38 Wed 23 Apr     , Ira Weiny wrote:
> 
> The symptom is that nodes drop out of the IPoIB mcast group after a node
> temporarily goes catatonic.  The details are:
> 
>    1) Issues on a node cause a soft lockup of the node.
>    2) OpenSM does a normal light sweep.
>    3) MADs to the node time out since the node is in a "bad state"

Normally during light sweep OpenSM will not query nodes. I think OpenSM
should not detect such soft lockup unless ib link state was changed and
heavy sweep was triggered. Is this the case?

>    4) OpenSM marks the node down and drops it from internal tables, including
>       mcast groups.
>    5) Node recovers from soft lock up condition.
>    6) A subsequent sweep causes OpenSM see the node and add it back to the
>       fabric.
>    7) Node is fully functional on the verbs layer but IPoIB never knew anything
>       was wrong so it does _not_ rejoin the mcast groups.  (This is different
>       from the condition where the link actually goes down.)

If my approach above is correct it should be same as port down/up
handling. And as was noted already in this thread OpenSM should ask
for reregistration (by setting client reregistration bit).

I see your patch - seems this part is buggy in OpenSM now, will see
closer to this.

Sasha



More information about the general mailing list