[ofa-general] Both opensm's are in SMINFO_STANDBY and none of them claims master
Hal Rosenstock
halr at voltaire.com
Tue May 22 03:53:02 PDT 2007
On Tue, 2007-05-22 at 02:31, Venkatesh Babu wrote:
> Hal Rosenstock wrote:
>
> >
> >Can you at least use OFED 1.2 management (OpenSM and management
> >libraries) with the rest being OFED 1.1 ?
> >
> >
> Are these backward compatible ?
Yes, user_mad kernel module has been at ABI version 5 for quite some
time now.
> >There are a number of bugs which have been fixed which might affect
> >this. The one I can think of off the top of my head is a fix to atomics
> >in OpenSM's complib. I think that was found and fixed post OFED 1.1.
> >I'll confirm this tomorrow.
The atomic fix was in OpenSM 2.0.5 but there are numerous other fixes
(see OpenSM release notes for OFED 1.2).
> >There may also be some important kernel differences (in user_mad.c or
> >mad.c) which might be relevant.
> >
> >
> It would be great if you can find these particular patches, we could
> apply these onto OFED 1.1
> instead of migrating to OFED 1.2.
The one I see that might be related is the following:
commit 39798695b4bcc7b145f8910ca56195808d3a7637
Author: Roland Dreier <rolandd at cisco.com>
Date: Mon Nov 13 09:38:07 2006 -0800
IB/mad: Fix race between cancel and receive completion
When ib_cancel_mad() is called, it puts the canceled send on a list
and schedules a "flushed" callback from process context. However,
this leaves a window where a receive completion could be processed
before the send is fully flushed.
This is fine, except that ib_find_send_mad() will find the MAD and
return it to the receive processing, which results in the sender
getting both a successful receive and a "flushed" send completion for
the same request. Understandably, this confuses the sender, which is
expecting only one of these two callbacks, and leads to grief such as
a use-after-free in IPoIB.
Fix this by changing ib_find_send_mad() to return a send struct only
if the status is still successful (and not "flushed"). The search of
the send_list already had this check, so this patch just adds the same
check to the search of the wait_list.
Signed-off-by: Roland Dreier <rolandd at cisco.com>
My search was not exhaustive.
> By the way, when is production quality OFED 1.2 is supposed to be
> released ?
It was supposed to be released already but we are closing in on rc4 (May
30) with the release to follow shortly thereafter (1-2 weeks).
> >I was referring to using perfquery, not ibnetdiscover.
> >
> >
> I don't have that output right now. But I found that all other error
> counters were zero except port_xmit_discards.
It would be useful to get these to be sure after the problem occurs.
> >>ibwarn: [5895] handle_port: NodeInfo on DR path [0][1][9] port 9 failed,
> >>skipping port
> >>
> >>
> >
> >Was this node rebooting while you did this or is there some other issue
> >?
> >
> >
> Yes, it is quite possible that node was being rebooted.
>
> >
> >So run these (before and after):
> >perfquery 12 18
> >perfquery 12 11
> >perfquery 12 10
> >perfquery 12 19
> >
> >and
> >
> >perfquery 12 9
> >
> >
> Unfortunately the systems got rebooted and issue is lost. I was able
> to collect the perfquery output. It looks like now it is seeing some errors.
Are they incrementing ? Which node is this ? I think some of them would
increment on node reboot.
-- Hal
> [root at vortex3l-83 ~]# perfquery 12 9
> # Port counters: Lid 12 port 9
> PortSelect:......................9
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................2
> LinkDowned:......................255
> RcvErrors:.......................1
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................41484
> XmtDiscards:.....................4918
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................1
> XmtBytes:........................2050081143
> RcvBytes:........................4294967295
> XmtPkts:.........................14539343
> RcvPkts:.........................37028545
> [root at vortex3l-83 ~]# perfquery 12 10
> # Port counters: Lid 12 port 10
> PortSelect:......................10
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................27
> LinkDowned:......................255
> RcvErrors:.......................0
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................19936
> XmtDiscards:.....................5192
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtBytes:........................4294967295
> RcvBytes:........................4294967295
> XmtPkts:.........................1739931538
> RcvPkts:.........................1794380558
> [root at vortex3l-83 ~]# perfquery 12 11
> # Port counters: Lid 12 port 11
> PortSelect:......................11
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................0
> LinkDowned:......................255
> RcvErrors:.......................1
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................8963
> XmtDiscards:.....................5636
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtBytes:........................4294967295
> RcvBytes:........................4294967295
> XmtPkts:.........................2375935494
> RcvPkts:.........................2714377528
> [root at vortex3l-83 ~]# perfquery 12 18
> # Port counters: Lid 12 port 18
> PortSelect:......................18
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................24
> LinkDowned:......................220
> RcvErrors:.......................0
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................65535
> XmtDiscards:.....................23628
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtBytes:........................4294967295
> RcvBytes:........................4294967295
> XmtPkts:.........................604709394
> RcvPkts:.........................448409077
> [root at vortex3l-83 ~]# perfquery 12 19
> # Port counters: Lid 12 port 19
> PortSelect:......................19
> CounterSelect:...................0x0100
> SymbolErrors:....................65535
> LinkRecovers:....................21
> LinkDowned:......................247
> RcvErrors:.......................0
> RcvRemotePhysErrors:.............0
> RcvSwRelayErrors:................65535
> XmtDiscards:.....................37754
> XmtConstraintErrors:.............0
> RcvConstraintErrors:.............0
> LinkIntegrityErrors:.............0
> ExcBufOverrunErrors:.............0
> VL15Dropped:.....................0
> XmtBytes:........................4294967295
> RcvBytes:........................4294967295
> XmtPkts:.........................3958092428
> RcvPkts:.........................3679343076
> [root at vortex3l-83 ~]#
>
> -VBabu
More information about the general
mailing list