[ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.

Ira Weiny weiny2 at llnl.gov
Thu Apr 24 14:35:55 PDT 2008


On Thu, 24 Apr 2008 12:07:03 -0700
Hal Rosenstock <hrosenstock at xsigo.com> wrote:

> On Thu, 2008-04-24 at 09:57 -0700, Ira Weiny wrote:
> 
> > > One side comment on the non OpenSM aspect of this: 
> > > 
> > > Why is the node temporarily unavailable ? There is a "contract" that the
> > > node makes with the SM that it clearly isn't honoring. Is any
> > > investigation going on relative to this aspect of the issue ?
> > > 
> > 
> > Yes, we are working on finding the root cause.  I agree that the "contract" is
> > not being honored.  This is one of the reasons I was hesitant to implement any
> > fix to be submitted. 
> 
> I think the two issues can be tackled in parallel.
> 
> > I don't think this is truly a bug in the stack.
> 
> Any ideas on what it is ? If not, would you be willing to try something
> assuming the end node issue is easily reproducible ?

The root cause is something to do with a users job causing this "soft lockup"
in the kernel.  We believe sometimes they will run the node (diskless/no swap)
out of memory.  Under the OOM condition I don't think the node can be trusted.
Unfortunately, this is another case where we can't seem to reproduce the issue
without the users job.  :-(

As per a previous email I was excited about Or mentioning perhaps another way
to simulate this condition on the IB side.  I have set that up and see some
issues there.  I will see what I can find.

> 
> > However, I could see this causing issues for people[*] and it might be nice to
> > have a "fix".
> 
> Sure; both are issues which should be understood better and fixed IMO.

Agreed, I have spoken with our other developer and he is still trying to get
a reproducer.

Ira




More information about the general mailing list