[openib-general] [RFC] Performance Manager

Fri Jan 26 14:52:21 PST 2007

Hi,

Below is a proposal for an OpenFabrics/OpenIB performance manager
(PerfManager). It includes a phased implementation plan. Comments
welcome. Thanks in advance.

-- Hal

Performance Manager

This document will describe an architecture and a phased plan 
for an OpenFabrics OpenIB performance manager.

Currently, there is no open source performance manager, only
a perfquery diagnostic tool which some have scripted into a 
"poor man's" performance manager.

The primary responsibilities of the performance manager are to:
1. Monitor subnet topology
2. Based on subnet topology, monitor performance and error counters.
   Also, possible counters related to congestion.
3. Perform data reduction (various calculations (rates, histograms, etc.))
   on counters obtained
4. Log performance data and indicate "interesting" related events

Performance Manager Components
1. Determine subnet topology
   Performance manager can determine the subnet topology by subscribing
   for GID in and out of service events. Upon receipt of a GID in service
   event, use GID to query SA for corresponding LID by using SubnAdmGet
   NodeRecord with PortGUID specified. It would utilize the LID and NumPorts
   returned and add this to the monitoring list. Note that the monitoring
   list can be extended to be distributed with the manager "balancing" the
   assignments of new GIDs to the set of known monitors. For GID out of 
   service events, the GID is removed from the monitoring list.

2. Monitoring
   Counters to be monitored include performance counters (data octets and 
   packets both receive and transmit) and error counters. These are all in 
   the mandatory PortCounters attribute. Future support will include the 
   optional 64 bit counters, PortExtendedCounters (as this is only known 
   to be supported on one IB device currently). Also, one congestion 
   counter (PortXmitWait) will also be monitored (on switch ports) initially.

   Polling rather than samples will be used as the monitoring technique. The
   polling rate configurable from 1-65535 seconds (default TBD)
   Note that with 32 bit counters, on 4x SDR links, byte counts can max out in
   16 seconds and on 4x DDR links in 8 seconds. The polling rate needs to
   deal with this is accurate byte and packet rates are desired. Since IB 
   counters are sticky, the counters need to be reset when they get "close" 
   to max'ing out. This will result in some inaccuracy. When counters are 
   reset, the time of the reset will be tracked in the monitor and will be
   queryable. Note that when the 64 bit counters are supported more generally,
   the polling rate can be reduced.

   The performance manager will support parallel queries. The level of 
   parallelism is configurable with a default of 64 queries outstanding 
   at one time.

   Configuration and dynamic adjustment of any performance manager "knobs"
   will be supported.

   Also, there will be a console interface to obtain performance data.
   It will be able to reset counters, report on specific nodes or 
   node types of interest (CAs only, switches only, all, ...). The 
   specifics are TBD.

3. Data Reduction
   For errors, rate rather than raw value will be calculated. Error
   event is only indicated when rate exceeds a threshold.
   For packet and byte counters, small changes will be aggregated
   and only significant changes are updated.
   Aggregated histograms (per node, all nodes (this is TBD))) for each 
   counter will be provided. Actual counters will also be written to files.
   NodeGUID will be used to identify node. File formats are TBD. One
   format to be supported might be CSV.

4. Logging
   "Interesting" events determined by the performance manager will be
   logged as well as the performance data itself. There are some interesting
   scalability issues here especially for the distributed model.

   Events will be based on rates which are configured as thresholds.
   There will be configurable thresholds for the error counters with
   reasonable defaults. Correlation of PerfManager and SM events is 
   interesting but not a mandatory requirement.

Performance Manager Scalability
Clearly as the polling rate goes up, the number of nodes which can be
monitored from a single performance management node decreases. There is
some evidence that a single dedicated management node may not be able to
monitor the largest clusters at a rapid rate.

There are numerous PerfManager models which can be supported:
1. Integrated as thread(s) with OpenSM (run only when SM is master)
2. Standby SM
3. Standalone PerfManager (not running with master or standby SM)
4. Distributed PerfManager (most scalable approach)

The simplest model is to run the PerfManager with the master SM. This has
the least scalability but is the simplest model. Note that in this model
the topology can be obtained without the GID in and out of service events
but this is needed for any of the other models to be supported.

The next model is to run the PerfManager with a standby SM. Standbys are not
doing much currently (polling the master) so there is much idle CPU.
The downside of this approach is that if the standby takes over as master,
the PerfManager would need to be moved (or is becomes model 1).

A totally separate standlone PerfManager would allow for a deployment
model which eliminates the downside of model 2 (standby SM). It could
still be built in a similar manner with model 2 with unneeded functions
(SM and SA) not included.

The most scalable model is a distributed PerfManager. One approach to
distribution is a hierarchial model where there is a PerfManager at the
top level with a number of PerfMonitors which are responsible for some
portion of the subnet.

The separation of PerfManager from OpenSM brings up the following additional
issues: 
1. What communication is needed between OpenSM and the PerfManager ?
2. Integration of interesting events with OpenSM log
(Does performance manager assume OpenSM ? Does it need to work with vendor
SMs ?)

Hierarchial distribution brings up some additional issues:
1. How is the hierarchy determined ?
2. How do the PerfManager and PerfMonitors find each other ?
3. How is the subnet divided amongst the PerfMonitors
4. Communication amongst the PerfManager and the PerfMonitors

In terms of inter manager communication, there seem to be several 
choices:
1. Use vendor specific MADs (which can be RMPP'd) and build on top of
this
2. Use IPoIB which is much more powerful as sockets can then be utilized.

The only downside of IPoIB is that it requires multicast to be functioning.
It seems reasonable to require IPoIB across the management nodes. This
can either be a separate IPoIB subnet or a shared one with other endnodes
on the subnet. (If this communication is built on top of sockets, it
can be any IP subnet amongst the manager nodes).

The first implementation phase will address models 1-3. Model 3 is optional
as it is similar to models 1 and 2 and may be not be needed.

Model 4 will be addressed in a subsequent implementation phase (and a future
version of this document). Model 4 can be built on the basis of models 1 and 
2 where some SM, not necessarily master, is the PerfManager and the rest are 
PerfMonitors.

Performance Manager Redundancy
TBD (future version of this document)

Congestion Management
TBD (future version of this document)

QoS Management
TBD (future version of this document)