[ofa-general] IB performance stats (revisited)

Wed Jul 11 08:56:31 PDT 2007

>Hi Marc,
>
>I wish I had a large enough fabric worth testing collectl on...
>  
>
there may be a disconnect here as collectl collects data locally.  on a 
typical system, taking 10 second samples for all the different 
subsystems it support (though you can certainly turn up the frequency if 
you like) takes about 2MB/day and retains it for a week,  This does OFED 
support out-of-the-box, using perfquery to read/clear the counters.  
Just install it and type:

collectl -scmx -oTm (lots of other combinations of choices)

and you'll see data for cpu, memory and interconnect data with millisec 
timestamps as follows:

#             
<--------CPU--------><-----------Memory----------><----------InfiniBand---------->
#Time         cpu sys inter  ctxsw free buff cach inac slab  map   KBin  
pktIn  KBOut pktOut Errs
11:55:06.004    0   0   261     44   7G  46M 268M 151M 249M  21M      
0      0      0      0    0
11:55:07.004    0   0   275     61   7G  46M 268M 151M 249M  21M      
0      0      0      0    0
11:55:08.004    0   0   251     18   7G  46M 268M 151M 249M  21M      
0      0      0      0    0
11:55:09.004    0   0   251     23   7G  46M 268M 151M 249M  21M      
0      0      0      0    0

>I did the math for how much data would be collected for 10Knodes
>cluster. It is ~7MB for each iteration: 
>10K ports 
>* 6 (3 level fabric * 2 ports on each link)
>* 32 byte (data/pkts tx/rx) + 22byte (err counters) + 64byte (cong
>counters) = 116bytes
>
>Seems reasonable - but adds up to large amount of data over a day period
>assuming a collect every second:
>24*60*60 *116*10000*6 = 6.01344e+11 Bytes of storage
>  
>
no disagreement.  that's why I chose NOT to try to solve the distributed 
data collection problem.  collectl runs locally wiht <0.1% cpu overhead.
-mark

>Eitan Zahavi
>Senior Engineering Director, Software Architect
>Mellanox Technologies LTD
>Tel:+972-4-9097208
>Fax:+92-4-9593245
>P.O. Box 586 Yokneam 20692 ISRAEL
>
> 
>
>  
>
>>-----Original Message-----
>>From: Mark Seger [mailto:Mark.Seger at hp.com] 
>>Sent: Wednesday, July 11, 2007 5:51 PM
>>To: Eitan Zahavi
>>Cc: Hal Rosenstock; Ira Weiny; general at lists.openfabrics.org; 
>>Ed.Finn at FMR.COM
>>Subject: Re: [ofa-general] IB performance stats (revisited)
>>
>>
>>
>>Eitan Zahavi wrote:
>>
>>    
>>
>>>Hi Marc,
>>>
>>>I published an RFC and later had discussions regarding the 
>>>      
>>>
>>distribution 
>>    
>>
>>>of query ownership of switch counters.
>>>Making this ownership purely dynamic, semi-dynamic or even 
>>>      
>>>
>>static is an 
>>    
>>
>>>implementation tradeoff.
>>>However, it can be shown that the maximal number of switches 
>>>      
>>>
>>a single 
>>    
>>
>>>compute node would be responsible for is <= number of switch 
>>>      
>>>
>>levels. So 
>>    
>>
>>>no problem to get counters every second...
>>>
>>>The issue is: what do you do with the size of data collected?
>>>This is only relevant if monitoring is run in "profiling mode" 
>>>otherwise only link health errors should be reported.
>>> 
>>>
>>>      
>>>
>>I use IB data for performance data typically for 
>>system/application diagnostics.  I run a tool I wrote (see
>>http://sourceforge.net/projects/collectl/) as a service on 
>>most systems and it gathers well over hundreds of performance 
>>metrics/counters on everything from  cpu load, memory, 
>>network,  infiniband, disk, etc.  The philosophy here is that 
>>if something goes wrong, it may be too late to then run some 
>>diagnostic.  Rather you need to have already collected the 
>>data, especially if this is an intemittent problem.  When 
>>there is no need to look at the data, it just gets purged 
>>away after a week.
>>
>>There have been situation where someone reports a batch 
>>program they ran the other day was really slow and they 
>>didn't change anything.  By being able to pull up a 
>>monitoring log and seeing what the system was doing at the 
>>time of the run might reveal their network was saturated and 
>>therefore their MPI job was impacted.  You can't very well 
>>turn on diagnostics and rerun the application because system 
>>conditions have probably changed.
>>
>>Does that help?  Why don't you try installing collectl and 
>>see what it does...
>>
>>-mark
>>
>>
>>
>>    
>>