[ofa-general] IB performance stats (revisited)

Wed Jun 27 08:30:00 PDT 2007

On Wed, 2007-06-27 at 10:10, Mark Seger wrote:
> btw - I've cc'd Ed on this so be sure to include him in your replies.
> 
> Hal Rosenstock wrote:
> > On Wed, 2007-06-27 at 09:17, Mark Seger wrote:
> >   
> >> I had posted something about this some time last year but now actually 
> >> have some data to present.
> >> My problem statement with IB is there is no efficient way to get 
> >> time-oriented performance numbers for all types of IB traffic.   As far 
> >> as I know nothing is available for all types of traffic, such as MPI. 
> >>     
> >
> > Not sure what you mean here. Are you looking for MPI counters ?
> >   
> sorry for not being clearer.  I'm looking for total aggregate I/O.
> >> This is further complicated because IB counters do not wrap and as a 
> >> result when the counters are integers, they end up latching in <30 
> >> seconds when under load.
> >>     
> >
> > This is mostly a problem for the data counters. This is what the
> > extended counters are for
> >   
> but it's the data counters I'm interested in.

Yes, there are data counters in both PortCounters and
PortCountersExtended. The latter is an optional attribute.

> >> The only way I am aware to do what I want to 
> >> do is by running perfquery AND then clearing the counters after each 
> >> request which by definition prevents anyone else from accessing the 
> >> counters including multiple instances of my program.
> >>     
> >
> > Yes, it is _bad_ if there are essentially multiple performance managers
> > resetting the counters.
> >   
> I realize it's bad but since the counters don't wrap I have no alternative.
> > There's now an experimental performance manager which has been discussed
> > on the list. The performance data collected can be accessed.
> >   
> alas, since I use this tool on commercial systems, I can't run it 
> against experimental code.  perhaps when the experimental becomes real I 
> can.

It should be in the OFED 1.3 timeframe. Also, there are vendor
Performance Managers too.

>   I'll try to find the notes in the archives.

I can send you this if you can't find it.

> >> To give people a better idea of what I'm talking about, below is an 
> >> extract from a utility I've written called 'collectl' which has been in 
> >> use on HP systems for about 4 years and which we've now Open Sourced at 
> >> http://sourceforge.net/projects/collectl [shameless plug].  In the 
> >> following sample I've requested cpu, network and IB stats (there are 
> >> actually a whole lot of other things you can examine and you can learn 
> >> more at http://collectl.sourceforge.net/index.html).
> >>     
> >
> > So you are looking for packets/bytes in/out only.
> >   
> That's a good start.  Since I'm using perfquery I'm also reporting 
> aggregate error counts as well as you can see in my program output 
> below.  The theory is these should rarely be set and if they are, their 
> total should be sufficient to highly a problem without taking up a lot 
> of screen real estate.
> >> Anyhow, what 
> >> you're seeing below is a sample taken every second.  At first there is 
> >> no IB traffic.  Then I start a 'netperf' and you can see the IB stats 
> >> jump.  A few seconds later I do a 'ping -f -s50000' to the ib interface 
> >> and you can now see an increase in the network traffic.
> >>
> >> #         
> >> <--------CPU--------><-----------Network----------><----------InfiniBand---------->
> >> #Time     cpu sys inter  ctxsw netKBi pkt-in  netKBo pkt-out   KBin  
> >> pktIn  KBOut pktOut Errs
> >> 08:48:19    0   0  1046    137      0      4       0       2      0      
> >> 0      0      0    0
> >> 08:48:20    2   2 18659    170      0     10       0       5    925  
> >> 10767  80478  41636    0
> >> 08:48:21   14  14 92368   1882      0      9       1      10   3403  
> >> 39599 463892 235588    0
> >> 08:48:22   14  14 92167   2243      0      8       0       4   3186  
> >> 37081 471246 238743    0
> >> 08:48:23   12  12 92131   2382      0      3       0       2   4456  
> >> 37323 470766 238488    0
> >> 08:48:24   13  13 91708   2691      7    106      12     104   7300  
> >> 38542 466580 236450    0
> >> 08:48:25   14  14 91675   2763     11    175      20     175   7434  
> >> 38417 463952 235146    0
> >> 08:48:26   13  13 91712   2716     11    174      20     175   7486  
> >> 38464 465195 235767    0
> >> 08:48:27   14  14 91755   2742     11    171      19     171   7502  
> >> 38656 465079 235720    0
> >> 08:48:28   13  13 90131   2126     12    178      20     179   8257  
> >> 44080 424930 217067    0
> >> 08:48:29   13  13 89974   2389     13    191      22     191   7801  
> >> 37094 457082 231523    0
> >>
> >> here's another display option where you can see just the ipoib traffic 
> >> along with other network stats
> >>
> >> # NETWORK STATISTICS (/sec)
> >> #         Num    Name  InPck  InErr OutPck OutErr   Mult   ICmp   
> >> OCmp    IKB    OKB
> >> 09:04:51    0     lo:      0      0      0      0      0      0      
> >> 0      0      0
> >> 09:04:51    1   eth0:     23      0      4      0      0      0      
> >> 0      1      0
> >> 09:04:51    2   eth1:      0      0      0      0      0      0      
> >> 0      0      0
> >> 09:04:51    3    ib0:    900      0    900      0      0      0      0   
> >> 1775   1779
> >> 09:04:51    4   sit0:      0      0      0      0      0      0      
> >> 0      0      0
> >> 09:04:52    0     lo:      0      0      0      0      0      0      
> >> 0      0      0
> >> 09:04:52    1   eth0:    127      0    126      0      0      0      
> >> 0      8     15
> >> 09:04:52    2   eth1:      0      0      0      0      0      0      
> >> 0      0      0
> >> 09:04:52    3    ib0:   2275      0   2275      0      0      0      0   
> >> 4488   4497
> >> 09:04:52    4   sit0:      0      0      0      0      0      0      
> >> 0      0      0
> >>
> >> While this is a relatively light-weight operation (collectl uses <0.1% 
> >> of the cpu), I still do have to call perfquery every second and that 
> >> does generate a little overhead.  Furthermore, since I'm continuously 
> >> resetting the counters multiple instances of my tool or any other tool 
> >> that relies on these counters won't work correctly!
> >>
> >> One solution that had been implemented in the Voltaire stack worked 
> >> quite well and that was a loadable module that read/cleared the HCA 
> >> counters, but exported them as wrapping counters in /proc.  That way 
> >> utilities could access the counters in /proc without stepping on each 
> >> others toes.  
> >>     
> >
> > Once in /proc, how are they all collected up ? Via IPoIB or out of band
> > ethernet ?
> >   
> Not sure I understand the question.  They're written to /proc via a 
> module.  They're collected up via my tool simply reading them back and 
> parsing the return string which looks like
> 
> ib0-1: 1 0 1 0x0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 
> This is essentially the same data reported by get_pcounter reformatted 
> to a single line for easier/faster parsing by collectl

I was thinking your tool collects this info from all nodes in the
network somehow.

> >> While still not the best solution, as long as the counters 
> >> don't wrap in the HCA, read/clear is the only way to do what it is I'm 
> >> trying to do, unless of course someone has a better solution.
> >>     
> >
> > Doesn't have the same problem as doing it the PMA way ? Doesn't this
> > impact other performance managers ?
> >   
> Good point, but I guess I'm between a rock and a hard place.  imho: as 
> long as the counters don't wrap this problem will never be solved.

It's the IBTA standard (rather than IETF style counters). I don't think
it's going to change.

> I'm trying to address a specific monitoring scenario, one which collects 
> data locally for analysis after a system problem occurs.  I discovered 
> long ago that central management solutions may work fine when trying to 
> assess the health of many systems, but when something goes wrong with 
> the network the only data that can tell you what's going wrong can't get 
> back to the management station over the now broken network.  My 
> philosophy is if you want to continuously collect reliable performance 
> metrics you need to use minimal system resources to do so and that means 
> no network communications.  I guess that means people need to decide if 
> they want to use collectl to gather local IB stats they have to forego 
> doing it globally.

Guess that's a tradeoff that customers will may need to make. In your
environment, sounds like one turns the performance manager off.

As the PerfMgr is an unarchitected IBA component, there are no events
defined which might help with coordinating this. So either this would
need to be vendor specific, or the two tools will interfere with each
other.

> So what is the chance of ever seeing wrapping IB counters?  Probably 
> none, right?  8-(
> 
> >> I also 
> >> realize with 64 bit counters this becomes a non-issue but I'm trying to 
> >> solve the more general case.
> >>     
> >
> > More devices are supporting these and it should be easier to do so with
> > IBA 1.2.1
> >   
> Is there an easy way to tell how wide the counters are via software?  Do 
> any utilities currently report this?

Yes via the PMA it can be done with some extra queries.

-- Hal

> > -- Hal
> >
> >   
> >> comments?  flames?  8-)
> >>
> >> -mark
> >>
> >> _______________________________________________
> >> general mailing list
> >> general at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>
> >> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
> >>     
>