[ofa-general] [OpenSM][RFC] OpenSM Proposed Perf Manager
Hal Rosenstock
halr at voltaire.com
Mon May 14 03:58:34 PDT 2007
On Sun, 2007-05-13 at 15:55, Sasha Khapyorsky wrote:
> Hi Ira,
>
> Thanks for the great work!
Indeed :-)
> On 18:49 Tue 08 May , Ira Weiny wrote:
> > I would like to submit to the list a performance manager which I have been
> > working on for OpenSM.
> >
> > It is implemented as the first proposed architecture model set forth by Hal (As
> > an integrated thread to OpenSM.) As such it works fine on our small test
> > cluster but there is some concern about its scalability.
> >
> > I have extended this architecture with an idea of my own. This idea is to have
> > a plug-able module for the "event database". With this interface one could
> > write their own Data reduction, logging, and tracking methods. Here at LLNL I
> > propose to use this to add counter and subnet events directly to our management
> > database which is used to show system status to our operators. Other
> > installations might prefer other methods of logging, SNMP for example. This
> > patch includes a "reference" implementation of this "event database" which
> > stores the information internally until the user requests a "dump".
>
> I like this event db idea, but not sure this should not be integral part
> of the low level perfmgr stuff - as it is currently implemented without
> such plugin loaded PerfMgr just doesn't work - this unconditionally tries
> to pull all ports counters, but has nothing to do with it without plugin.
>
> Instead I would purpose to have a builtin PerfMgr which will be able to
> pull and store performance related data and then to call "generic" event
> manager which can process such data. This also will help to have simpler
> generic API for such event db plugin so other parts of OpenSM will be
> able to report events using same method(s). What do you think?
Sounds better to me. Ira ?
> Some patch related comments are inlined below.
>
> Sasha
>
> >
> > Let the flames begin,
> > Ira Weiny
> > weiny2 at llnl.gov
> >
> >
> >
> > >From 4ce288b6a5a371872cf160f6d4e29e768a065cb9 Mon Sep 17 00:00:00 2001
> > From: Ira K. Weiny <weiny2 at llnl.gov>
> > Date: Tue, 24 Apr 2007 23:44:15 -0700
> > Subject: [PATCH] OpenSM Proposed Perf Manager
> >
> > Features include:
> > * Create "PerfMgr" thread and sweep all ports on the subnet every
> > sweep_time seconds
> > * port counter clear on overflow
> > * plugable architecture for the "event" database
> > * Output machine and human readable output in the default event database
> > dump
> > * Control using the "perfmgr" command in the console
> >
> > Known Issues
> > * Not tested at scale.
> > * Event database should record trap events and other "intresting" subnet
> > events.
> > * port counter log warnings should be configureable not hard coded.
> > * partitions are not handled yet.
> > * Code might not be as pristine as I would like
> >
> > Enable using --enable-perf-mgr
> >
> > Signed-off-by: Ira K. Weiny <weiny2 at llnl.gov>
> > ---
> > osm/Makefile.am | 3 +-
> > osm/config/osmvsel.m4 | 26 ++
> > osm/configure.in | 5 +-
> > osm/eventdb/Makefile.am | 37 ++
> > osm/eventdb/autogen.sh | 15 +
> > osm/eventdb/configure.in | 70 ++++
> > osm/eventdb/libibeventdb.map | 5 +
> > osm/eventdb/libibeventdb.spec.in | 38 ++
> > osm/eventdb/libibeventdb.ver | 9 +
> > osm/eventdb/src/ibeventdb.c | 622 +++++++++++++++++++++++++++++++++
> > osm/include/Makefile.am | 2 +
> > osm/include/iba/ib_types.h | 74 ++++
> > osm/include/opensm/osm_base.h | 23 ++
> > osm/include/opensm/osm_event_db.h | 151 ++++++++
> > osm/include/opensm/osm_madw.h | 40 +++
> > osm/include/opensm/osm_msgdef.h | 1 +
> > osm/include/opensm/osm_opensm.h | 4 +
> > osm/include/opensm/osm_perfmgr.h | 223 ++++++++++++
> > osm/include/opensm/osm_subnet.h | 18 +
> > osm/opensm.spec.in | 11 +-
> > osm/opensm/Makefile.am | 5 +-
> > osm/opensm/configure.in | 3 +
> > osm/opensm/main.c | 19 +
> > osm/opensm/osm_console.c | 78 +++++
> > osm/opensm/osm_event_db.c | 172 +++++++++
> > osm/opensm/osm_opensm.c | 24 ++
> > osm/opensm/osm_perfmgr.c | 686 +++++++++++++++++++++++++++++++++++++
> > osm/opensm/osm_subnet.c | 51 +++
> > osm/opensm/osm_trap_rcv.c | 15 +
> > 29 files changed, 2425 insertions(+), 5 deletions(-)
> >
[snip...]
> > diff --git a/osm/eventdb/src/ibeventdb.c b/osm/eventdb/src/ibeventdb.c
> > new file mode 100644
> > index 0000000..e98f85c
> > --- /dev/null
> > +++ b/osm/eventdb/src/ibeventdb.c
> > @@ -0,0 +1,622 @@
> > +/*
> > + * Copyright (c) 2007 The Regents of the University of California.
> > + *
> > + * This software is available to you under a choice of one of two
> > + * licenses. You may choose to be licensed under the terms of the GNU
> > + * General Public License (GPL) Version 2, available from the file
> > + * COPYING in the main directory of this source tree, or the
> > + * OpenIB.org BSD license below:
> > + *
> > + * Redistribution and use in source and binary forms, with or
> > + * without modification, are permitted provided that the following
> > + * conditions are met:
> > + *
> > + * - Redistributions of source code must retain the above
> > + * copyright notice, this list of conditions and the following
> > + * disclaimer.
> > + *
> > + * - Redistributions in binary form must reproduce the above
> > + * copyright notice, this list of conditions and the following
> > + * disclaimer in the documentation and/or other materials
> > + * provided with the distribution.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + *
> > + */
> > +
> > +#if HAVE_CONFIG_H
> > +# include <config.h>
> > +#endif /* HAVE_CONFIG_H */
> > +
> > +#include <errno.h>
> > +#include <string.h>
> > +#include <stdlib.h>
> > +#include <time.h>
> > +#include <dlfcn.h>
> > +#include <stdint.h>
> > +#include <opensm/osm_event_db.h>
> > +#include <complib/cl_qmap.h>
> > +#include <complib/cl_passivelock.h>
> > +
> > +/**
> > + * Port counter object.
> > + * Store all the port counters for a single port.
> > + */
> > +typedef struct _osm_event_pc {
> > + struct {
> > + uint64_t symbol_err_cnt;
> > + uint64_t link_err_recover;
> > + uint64_t link_downed;
> > + uint64_t rcv_err;
> > + uint64_t rcv_rem_phys_err;
> > + uint64_t rcv_switch_relay_err;
> > + uint64_t xmit_discards;
> > + uint64_t xmit_constraint_err;
> > + uint64_t rcv_constraint_err;
> > + uint64_t link_int_err;
> > + uint64_t buffer_overrun_err;
> > + uint64_t vl15_dropped;
> > + uint64_t xmit_data;
> > + uint64_t rcv_data;
> > + uint64_t xmit_pkts;
> > + uint64_t rcv_pkts;
> > + time_t last_reset;
> > + } totals;
> > + osm_pc_reading_t previous;
> > +} osm_event_pc_t;
> > +
> > +/**
> > + * group port counters for ports into the nodes
> > + */
> > +typedef struct _osm_pc_node {
> > + cl_map_item_t map_item; /* must be first */
> > + uint64_t node_guid;
> > + osm_event_pc_t *ports;
> > + uint8_t num_ports;
> > +} osm_pc_node_t;
>
> Is it really needed to keep osm_pc_node_t nodes in separate db (qmap)?
> Why not to reuse already existed maps in osm_subn_t (we could add
> 'void *pm_data' or so field to osm_physp_t structure)?
My one concern would be evolving the PerfMgr. This is better now but is
this better when the PerfMgr is separated from the SM functionality ? I
know there are other things to untangle to get there.
-- Hal
[snip...]
More information about the general
mailing list