[ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master

Hal Rosenstock halr at voltaire.com
Tue May 22 03:53:02 PDT 2007


On Tue, 2007-05-22 at 02:31, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
> 
> >
> >Can you at least use OFED 1.2 management (OpenSM and management
> >libraries) with the rest being OFED 1.1 ?
> >  
> >
>  Are these backward compatible ?

Yes, user_mad kernel module has been at ABI version 5 for quite some
time now.

> >There are a number of bugs which have been fixed which might affect
> >this. The one I can think of off the top of my head is a fix to atomics
> >in OpenSM's complib. I think that was found and fixed post OFED 1.1.
> >I'll confirm this tomorrow.

The atomic fix was in OpenSM 2.0.5 but there are numerous other fixes
(see OpenSM release notes for OFED 1.2).

> >There may also be some important kernel differences (in user_mad.c or
> >mad.c) which might be relevant.
> >  
> >
>   It would be great if you can find these particular patches, we could 
> apply these onto OFED 1.1
> instead of migrating to OFED 1.2.

The one I see that might be related is the following:

commit 39798695b4bcc7b145f8910ca56195808d3a7637
Author: Roland Dreier <rolandd at cisco.com>
Date:   Mon Nov 13 09:38:07 2006 -0800

    IB/mad: Fix race between cancel and receive completion
    
    When ib_cancel_mad() is called, it puts the canceled send on a list
    and schedules a "flushed" callback from process context.  However,
    this leaves a window where a receive completion could be processed
    before the send is fully flushed.
    
    This is fine, except that ib_find_send_mad() will find the MAD and
    return it to the receive processing, which results in the sender
    getting both a successful receive and a "flushed" send completion for
    the same request.  Understandably, this confuses the sender, which is
    expecting only one of these two callbacks, and leads to grief such as
    a use-after-free in IPoIB.
    
    Fix this by changing ib_find_send_mad() to return a send struct only
    if the status is still successful (and not "flushed").  The search of
    the send_list already had this check, so this patch just adds the same
    check to the search of the wait_list.
    
    Signed-off-by: Roland Dreier <rolandd at cisco.com>

My search was not exhaustive.

>   By the way, when is production quality OFED 1.2 is supposed to be 
> released ?

It was supposed to be released already but we are closing in on rc4 (May
30) with the release to follow shortly thereafter (1-2 weeks).

> >I was referring to using perfquery, not ibnetdiscover.
> >  
> >
>  I don't have that output right now. But I found that all other error 
> counters were zero except port_xmit_discards.

It would be useful to get these to be sure after the problem occurs.

> >>ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed,
> >>skipping port
> >>    
> >>
> >
> >Was this node rebooting while you did this or is there some other issue
> >?
> >  
> >
>   Yes, it is quite possible that node was being rebooted.
> 
> >
> >So run these (before and after):
> >perfquery 12 18
> >perfquery 12 11
> >perfquery 12 10
> >perfquery 12 19
> >
> >and
> >
> >perfquery 12 9
> >  
> >
>   Unfortunately the systems got rebooted and issue is lost. I was able 
> to collect the perfquery output. It looks like now it is seeing some errors.

Are they incrementing ? Which node is this ? I think some of them would
increment on node reboot.

-- Hal

> [root at vortex3l-83 ~]# perfquery 12 9
> # Port counters: Lid 12 port 9
> PortSelect:......................9
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................2
> LinkDowned:......................255
> RcvErrors:.......................1
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................41484
> XmtDiscards:.....................4918
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................1
> XmtBytes:........................2050081143
> RcvBytes:........................4294967295
> XmtPkts:.........................14539343
> RcvPkts:.........................37028545
> [root at vortex3l-83 ~]# perfquery 12 10
> # Port counters: Lid 12 port 10
> PortSelect:......................10
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................27
> LinkDowned:......................255
> RcvErrors:.......................0
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................19936
> XmtDiscards:.....................5192
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtBytes:........................4294967295
> RcvBytes:........................4294967295
> XmtPkts:.........................1739931538
> RcvPkts:.........................1794380558
> [root at vortex3l-83 ~]# perfquery 12 11
> # Port counters: Lid 12 port 11
> PortSelect:......................11
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................0
> LinkDowned:......................255
> RcvErrors:.......................1
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................8963
> XmtDiscards:.....................5636
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtBytes:........................4294967295
> RcvBytes:........................4294967295
> XmtPkts:.........................2375935494
> RcvPkts:.........................2714377528
> [root at vortex3l-83 ~]# perfquery 12 18
> # Port counters: Lid 12 port 18
> PortSelect:......................18
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................24
> LinkDowned:......................220
> RcvErrors:.......................0
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................65535
> XmtDiscards:.....................23628
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtBytes:........................4294967295
> RcvBytes:........................4294967295
> XmtPkts:.........................604709394
> RcvPkts:.........................448409077
> [root at vortex3l-83 ~]# perfquery 12 19
> # Port counters: Lid 12 port 19
> PortSelect:......................19
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................21
> LinkDowned:......................247
> RcvErrors:.......................0
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................65535
> XmtDiscards:.....................37754
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtBytes:........................4294967295
> RcvBytes:........................4294967295
> XmtPkts:.........................3958092428
> RcvPkts:.........................3679343076
> [root at vortex3l-83 ~]#
> 
>   -VBabu




More information about the general mailing list