[Users] Event handling/notification in opensm

Ira Weiny weiny2 at llnl.gov
Thu Aug 16 09:22:31 PDT 2012


On Wed, 15 Aug 2012 13:34:59 -0600
Lloyd Brown <lloyd_brown at byu.edu> wrote:

> Hi, Ira.
> 
> Actually, it's much more likely that I misheard, than that you misspoke.
>  My understanding of the specs is fairly limited; I've slogged my way
> through a few small sections, and that's about it.
> 
> What I'm really trying to do is to capture and report on instances where
> there is likely an upcoming hardware failure.  For example, I've been
> told in the past that SymbolError counter increasing more than just a
> little (for some definition of "a little"), is probably indicative of a
> failing cable.
> 
> Right now I have something I hacked together that wraps around
> ibqueryerrors, and I run it on a cron.  Mostly I'm just trying to see if
> there's a better, more asynchronous way to get notified of these type of
> events.

This is how we have been running for a few years...

The only other thing which is open sourced right now is:

https://github.com/weiny2/libopensmskummeeplugin

This is my previous generation of plugin for the perfmgr.  It was designed to log all the counter data to a MySQL DB which could then be read by other external programs (specifically SKUMMEE [*]).  It is very old.  We never used it for 2 reasons.

	1) Our admins here determined it was to difficult to set up.
	2) It really only told you errors on a NodeGUID/port.  You then have to look up what node/port that is.  Again too difficult to use.

This is why we embarked on a new plugin which exports more information from OpenSM, we are not ready to open source that yet but we would like to someday.

Let me know if you find the above useful.

Ira

[*] http://sourceforge.net/projects/skummee/

> 
> Lloyd Brown
> Systems Administrator
> Fulton Supercomputing Lab
> Brigham Young University
> http://marylou.byu.edu
> 
> On 08/15/2012 01:27 PM, Ira Weiny wrote:
> > On Wed, 15 Aug 2012 12:41:44 -0600
> > Lloyd Brown <lloyd_brown at byu.edu> wrote:
> > 
> >> Since nobody seems to be starting any conversations on our newly-minted
> >> OFA users list, I guess I'll try.
> >>
> >> Is there any documentation somewhere that describes how to integrate
> >> trap-style events in opensm, into some external system?  For example, at
> >> the OFA User Day last week, Ira mentioned the new performance manager
> >> code in opensm, that would clear error counters when they reached 75% of
> >> the maximum, and would then send a trap about the event.
> > 
> > Sorry if I misspoke, the perfmgr does not send a trap about the event.  According to my interpretation of the spec the PM does not support InformInfo.  What it will do is log "out of band" clears which it detects as well as all non-zero error counters to the opensm.log.
> > 
> >>
> >> So far, I can see some trap related events in the opensm.log, but I have
> >> no idea how to do anything with them.  For example, I might want to
> >> execute a script, or send an SNMP trap to something else, etc.  Is there
> >> any way to integrate this, short of periodically parsing the logfile?
> >> Any equivalent to snmptrapd, to execute specific actions when specific
> >> traps are received?
> > 
> > Are you speaking of traps as defined in the spec?  The proper way to do this is to send an InformInfo "subscribe" to the SM(SA) or other class manager.  See 13.4.11 of the spec.
> > 
> > Unfortunately, right now I don't know of any software which allows for generic subscribing to the SA for traps/notices.  Nor do I know of any manager other than the SM which supports it.[*]
> > 
> > The Traps you see in OpenSM are generated by the hardware/software for various things which really help the SM effectively manage the fabric.  For example port state change traps by switches.  Other things which are less critical but still very important like node description changes have been added as time has progressed.
> > 
> > To play with this a bit you could check out ibsendtrap which is a test utility in infinband-diags.  (use: ./configure --enable-test-utils)[$]  But this only sends a few traps which it was coded to send and is not considered "ready for prime time" to be included in the default build.
> > 
> > Finally, this is a part of the spec is pretty confusing to me so I encourage others to help me out if I have said something wrong.
> > 
> > Sorry,
> > Ira
> > 
> > [*] and frankly I am not sure of the level of support by OpenSM either.
> > [$] git://beany.openfabrics.org/~iraweiny/infiniband-diags.git
> > 
> >>
> >> Thanks,
> >> -- 
> >> Lloyd Brown
> >> Systems Administrator
> >> Fulton Supercomputing Lab
> >> Brigham Young University
> >> http://marylou.byu.edu
> >> _______________________________________________
> >> Users mailing list
> >> Users at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/users
> > 
> > 


-- 
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
weiny2 at llnl.gov



More information about the Users mailing list