[ofa-general] IB performance stats (revisited)

Wed Jun 27 07:10:00 PDT 2007

btw - I've cc'd Ed on this so be sure to include him in your replies.

Hal Rosenstock wrote:
> On Wed, 2007-06-27 at 09:17, Mark Seger wrote:
>   
>> I had posted something about this some time last year but now actually 
>> have some data to present.
>> My problem statement with IB is there is no efficient way to get 
>> time-oriented performance numbers for all types of IB traffic.   As far 
>> as I know nothing is available for all types of traffic, such as MPI. 
>>     
>
> Not sure what you mean here. Are you looking for MPI counters ?
>   
sorry for not being clearer.  I'm looking for total aggregate I/O.
>> This is further complicated because IB counters do not wrap and as a 
>> result when the counters are integers, they end up latching in <30 
>> seconds when under load.
>>     
>
> This is mostly a problem for the data counters. This is what the
> extended counters are for
>   
but it's the data counters I'm interested in.
>> The only way I am aware to do what I want to 
>> do is by running perfquery AND then clearing the counters after each 
>> request which by definition prevents anyone else from accessing the 
>> counters including multiple instances of my program.
>>     
>
> Yes, it is _bad_ if there are essentially multiple performance managers
> resetting the counters.
>   
I realize it's bad but since the counters don't wrap I have no alternative.
> There's now an experimental performance manager which has been discussed
> on the list. The performance data collected can be accessed.
>   
alas, since I use this tool on commercial systems, I can't run it 
against experimental code.  perhaps when the experimental becomes real I 
can.  I'll try to find the notes in the archives.
>> To give people a better idea of what I'm talking about, below is an 
>> extract from a utility I've written called 'collectl' which has been in 
>> use on HP systems for about 4 years and which we've now Open Sourced at 
>> http://sourceforge.net/projects/collectl [shameless plug].  In the 
>> following sample I've requested cpu, network and IB stats (there are 
>> actually a whole lot of other things you can examine and you can learn 
>> more at http://collectl.sourceforge.net/index.html).
>>     
>
> So you are looking for packets/bytes in/out only.
>   
That's a good start.  Since I'm using perfquery I'm also reporting 
aggregate error counts as well as you can see in my program output 
below.  The theory is these should rarely be set and if they are, their 
total should be sufficient to highly a problem without taking up a lot 
of screen real estate.
>> Anyhow, what 
>> you're seeing below is a sample taken every second.  At first there is 
>> no IB traffic.  Then I start a 'netperf' and you can see the IB stats 
>> jump.  A few seconds later I do a 'ping -f -s50000' to the ib interface 
>> and you can now see an increase in the network traffic.
>>
>> #         
>> <--------CPU--------><-----------Network----------><----------InfiniBand---------->
>> #Time     cpu sys inter  ctxsw netKBi pkt-in  netKBo pkt-out   KBin  
>> pktIn  KBOut pktOut Errs
>> 08:48:19    0   0  1046    137      0      4       0       2      0      
>> 0      0      0    0
>> 08:48:20    2   2 18659    170      0     10       0       5    925  
>> 10767  80478  41636    0
>> 08:48:21   14  14 92368   1882      0      9       1      10   3403  
>> 39599 463892 235588    0
>> 08:48:22   14  14 92167   2243      0      8       0       4   3186  
>> 37081 471246 238743    0
>> 08:48:23   12  12 92131   2382      0      3       0       2   4456  
>> 37323 470766 238488    0
>> 08:48:24   13  13 91708   2691      7    106      12     104   7300  
>> 38542 466580 236450    0
>> 08:48:25   14  14 91675   2763     11    175      20     175   7434  
>> 38417 463952 235146    0
>> 08:48:26   13  13 91712   2716     11    174      20     175   7486  
>> 38464 465195 235767    0
>> 08:48:27   14  14 91755   2742     11    171      19     171   7502  
>> 38656 465079 235720    0
>> 08:48:28   13  13 90131   2126     12    178      20     179   8257  
>> 44080 424930 217067    0
>> 08:48:29   13  13 89974   2389     13    191      22     191   7801  
>> 37094 457082 231523    0
>>
>> here's another display option where you can see just the ipoib traffic 
>> along with other network stats
>>
>> # NETWORK STATISTICS (/sec)
>> #         Num    Name  InPck  InErr OutPck OutErr   Mult   ICmp   
>> OCmp    IKB    OKB
>> 09:04:51    0     lo:      0      0      0      0      0      0      
>> 0      0      0
>> 09:04:51    1   eth0:     23      0      4      0      0      0      
>> 0      1      0
>> 09:04:51    2   eth1:      0      0      0      0      0      0      
>> 0      0      0
>> 09:04:51    3    ib0:    900      0    900      0      0      0      0   
>> 1775   1779
>> 09:04:51    4   sit0:      0      0      0      0      0      0      
>> 0      0      0
>> 09:04:52    0     lo:      0      0      0      0      0      0      
>> 0      0      0
>> 09:04:52    1   eth0:    127      0    126      0      0      0      
>> 0      8     15
>> 09:04:52    2   eth1:      0      0      0      0      0      0      
>> 0      0      0
>> 09:04:52    3    ib0:   2275      0   2275      0      0      0      0   
>> 4488   4497
>> 09:04:52    4   sit0:      0      0      0      0      0      0      
>> 0      0      0
>>
>> While this is a relatively light-weight operation (collectl uses <0.1% 
>> of the cpu), I still do have to call perfquery every second and that 
>> does generate a little overhead.  Furthermore, since I'm continuously 
>> resetting the counters multiple instances of my tool or any other tool 
>> that relies on these counters won't work correctly!
>>
>> One solution that had been implemented in the Voltaire stack worked 
>> quite well and that was a loadable module that read/cleared the HCA 
>> counters, but exported them as wrapping counters in /proc.  That way 
>> utilities could access the counters in /proc without stepping on each 
>> others toes.  
>>     
>
> Once in /proc, how are they all collected up ? Via IPoIB or out of band
> ethernet ?
>   
Not sure I understand the question.  They're written to /proc via a 
module.  They're collected up via my tool simply reading them back and 
parsing the return string which looks like

ib0-1: 1 0 1 0x0000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

This is essentially the same data reported by get_pcounter reformatted 
to a single line for easier/faster parsing by collectl
>> While still not the best solution, as long as the counters 
>> don't wrap in the HCA, read/clear is the only way to do what it is I'm 
>> trying to do, unless of course someone has a better solution.
>>     
>
> Doesn't have the same problem as doing it the PMA way ? Doesn't this
> impact other performance managers ?
>   
Good point, but I guess I'm between a rock and a hard place.  imho: as 
long as the counters don't wrap this problem will never be solved.

I'm trying to address a specific monitoring scenario, one which collects 
data locally for analysis after a system problem occurs.  I discovered 
long ago that central management solutions may work fine when trying to 
assess the health of many systems, but when something goes wrong with 
the network the only data that can tell you what's going wrong can't get 
back to the management station over the now broken network.  My 
philosophy is if you want to continuously collect reliable performance 
metrics you need to use minimal system resources to do so and that means 
no network communications.  I guess that means people need to decide if 
they want to use collectl to gather local IB stats they have to forego 
doing it globally.

So what is the chance of ever seeing wrapping IB counters?  Probably 
none, right?  8-(

>> I also 
>> realize with 64 bit counters this becomes a non-issue but I'm trying to 
>> solve the more general case.
>>     
>
> More devices are supporting these and it should be easier to do so with
> IBA 1.2.1
>   
Is there an easy way to tell how wide the counters are via software?  Do 
any utilities currently report this?
> -- Hal
>
>   
>> comments?  flames?  8-)
>>
>> -mark
>>
>> _______________________________________________
>> general mailing list
>> general at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>>
>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>>