[ofa-general] smpquery regression in 1.3-rc1

Hal Rosenstock hrosenstock at xsigo.com
Thu Dec 20 08:49:26 PST 2007


On Thu, 2007-12-20 at 17:43 +0200, Yevgeny Kliteynik wrote:
> Hal Rosenstock wrote:
> > On Thu, 2007-12-20 at 13:42 +0200, Yevgeny Kliteynik wrote:
> >> Hal Rosenstock wrote:
> >>> On Wed, 2007-12-19 at 11:58 -0800, akepner at sgi.com wrote:
> >>>> We're seeing a regression in smpquery from alpha2 to rc1. 
> >>>>
> >>>> For example, with alpha2 I get:
> >>>> grommit:~ # smpquery -G nodeinfo 0x66a01a000737c
> >>>> # Node info: Lid 3
> >>>> BaseVers:........................1
> >>>> ClassVers:.......................1
> >>>> NodeType:........................Channel Adapter
> >>>> NumPorts:........................2
> >>>> SystemGuid:......................0x00066a009800737c
> >>>> Guid:............................0x00066a009800737c
> >>>> PortGuid:........................0x00066a01a000737c
> >>>> PartCap:.........................64
> >>>> DevId:...........................0x6278
> >>>> Revision:........................0x000000a0
> >>>> LocalPort:.......................2
> >>>> VendorId:........................0x00066a
> >>>> grommit:~ # 
> >>>>
> >>>>
> >>>> And with rc1, I get:
> >>>> grommit:~ # smpquery -G nodeinfo 0x66a01a000737c
> >>>> ibwarn: [5650] ib_path_query: sa call path_query failed
> >>>> smpquery: iberror: failed: can't resolve destination port 0x66a01a000737c
> >>>> grommit:~ #  
> >>>>
> >>>> But using a LID works fine:
> >>>> grommit:~ # smpquery nodeinfo 3
> >>>> # Node info: Lid 3
> >>>> BaseVers:........................1
> >>>> ClassVers:.......................1
> >>>> NodeType:........................Channel Adapter
> >>>> NumPorts:........................2
> >>>> SystemGuid:......................0x00066a009800737c
> >>>> Guid:............................0x00066a009800737c
> >>>> PortGuid:........................0x00066a01a000737c
> >>>> PartCap:.........................64
> >>>> DevId:...........................0x6278
> >>>> Revision:........................0x000000a0
> >>>> LocalPort:.......................2
> >>>> VendorId:........................0x00066a
> >>>> grommit:~ # 
> >>>>
> >>>> Strangest of all, running it under strace also works:
> >>>> grommit:~ # strace smpquery -G nodeinfo 0x66a01a000737c > /tmp/smpquery.out 
> >>>> .....
> >>>> grommit:~ # cat /tmp/smpquery.out
> >>>> # Node info: Lid 3
> >>>> BaseVers:........................1
> >>>> ClassVers:.......................1
> >>>> NodeType:........................Channel Adapter
> >>>> NumPorts:........................2
> >>>> SystemGuid:......................0x00066a009800737c
> >>>> Guid:............................0x00066a009800737c
> >>>> PortGuid:........................0x00066a01a000737c
> >>>> PartCap:.........................64
> >>>> DevId:...........................0x6278
> >>>> Revision:........................0x000000a0
> >>>> LocalPort:.......................2
> >>>> VendorId:........................0x00066a
> >>>> grommit:~ #
> >>>>
> >>>> Some weird race condition...
> >>>>
> >>>> Anyone else seeing the same?
> >>> -G requires a SA path record lookup so this could be an issue with that
> >>> timing out in some cases (assuming the port is active and the SM is
> >>> operational).
> >> I'm seeing the same problem.
> >> Sometimes the query works, and sometimes it doesn't.
> >> I also see that when the query fails, OpenSM doesn't get PathRecord query at all.
> >>
> >> Hal, can you elaborate on "that timing out in some cases" issue?
> > 
> > I just meant that the SM not responding (for an unknown reason right
> > now) would yield this effect.
> > 
> >> Adding Jack for the libibmad issue:
> >>
> >> I see that the ib_path_query() in libibmad/sa.c sometimes fails
> >> when calling safe_sa_call().
> > 
> > This could just be more detail on the same thing in terms of the
> > (smpquery) client which is layered on top of libibmad: the SA path query
> > timeout.
> > I would suggest running OpenSM in verbose mode (both instances are with
> > OpenSM) and seeing if it responds to the PathRecord query used by this
> > form of smpquery and continue troubleshooting from there based on the
> > result.
> 
> This is actually what I was saying here.
> I have *debugged* smpquery, and saw that the failing function is
> ib_path_query() in libibmad/sa.c
> As I've mentioned, I did run it with OpenSM in verbose mode, and saw
> that when smpquery fails, OpenSM log does not have any PathRecord request.
> When smpquery passes, I see the PathRecord request and response in the
> OpenSM log.

OK; that wasn't clear before but is now (that the failure appears to be
a client and not SM issue) :-) FWIW, I don't know what has changed that
would affect this so it could be a latent bug as opposed to a
regression.

-- Hal

> -- Yevgeny
> 
> > -- Hal
> > 
> >> -- Yevgeny
> >>
> >>> -- Hal
> >>> _______________________________________________
> >>> general mailing list
> >>> general at lists.openfabrics.org
> >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>
> >>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>>
> > 
> 



More information about the general mailing list