[ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master

Hal Rosenstock halr at voltaire.com
Tue May 22 17:01:10 PDT 2007


On Tue, 2007-05-22 at 20:01, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
> 
> >The one I see that might be related is the following:
> >
> >commit 39798695b4bcc7b145f8910ca56195808d3a7637
> >Author: Roland Dreier <rolandd at cisco.com>
> >Date:   Mon Nov 13 09:38:07 2006 -0800
> >
> >    IB/mad: Fix race between cancel and receive completion
> >    
> >    When ib_cancel_mad() is called, it puts the canceled send on a list
> >    and schedules a "flushed" callback from process context.  However,
> >    this leaves a window where a receive completion could be processed
> >    before the send is fully flushed.
> >    
> >    This is fine, except that ib_find_send_mad() will find the MAD and
> >    return it to the receive processing, which results in the sender
> >    getting both a successful receive and a "flushed" send completion for
> >    the same request.  Understandably, this confuses the sender, which is
> >    expecting only one of these two callbacks, and leads to grief such as
> >    a use-after-free in IPoIB.
> >    
> >    Fix this by changing ib_find_send_mad() to return a send struct only
> >    if the status is still successful (and not "flushed").  The search of
> >    the send_list already had this check, so this patch just adds the same
> >    check to the search of the wait_list.
> >    
> >    Signed-off-by: Roland Dreier <rolandd at cisco.com>
> >
> >My search was not exhaustive.
> >  
> >
>   It looks like this may be the fix for the MAD send errors.

Perhaps.

>  Do you 
> think this is the cause of opensm not grabbing the mastership from the 
> other ?

Unlikely but don't know for sure.

> >Are they incrementing ? Which node is this ? I think some of them would
> >increment on node reboot.
> >  
> >
>   Looks like some counters (Symbol errors, link downed) are reached the 
> top ceiling.

You should replace the cable and see if symbol errors improves. You may
need to clear these with perfquery -R.

I think Link downed will increment when the node reboots.

> This output was captured on node vortex3l-83, the one who runs opensm.
> Do you want the perfquery output before and after some time interval ?

I'm interested in VL15 drops to make sure that is not going on.

-- Hal

>  VBabu




More information about the general mailing list