[ofa-general] IB performance stats (revisited)

Wed Jul 11 07:51:01 PDT 2007

Eitan Zahavi wrote:

>Hi Marc,
>
>I published an RFC and later had discussions regarding the distribution
>of query ownership of switch counters.
>Making this ownership purely dynamic, semi-dynamic or even static is an
>implementation tradeoff.
>However, it can be shown that the maximal number of switches a single
>compute node would be responsible for is <= number of switch levels. So
>no problem to get counters every second...
>
>The issue is: what do you do with the size of data collected?
>This is only relevant if monitoring is run in "profiling mode" otherwise
>only link health errors should be reported.
>  
>
I use IB data for performance data typically for system/application 
diagnostics.  I run a tool I wrote (see 
http://sourceforge.net/projects/collectl/) as a service on most systems 
and it gathers well over hundreds of performance metrics/counters on 
everything from  cpu load, memory, network,  infiniband, disk, etc.  The 
philosophy here is that if something goes wrong, it may be too late to 
then run some diagnostic.  Rather you need to have already collected the 
data, especially if this is an intemittent problem.  When there is no 
need to look at the data, it just gets purged away after a week.

There have been situation where someone reports a batch program they ran 
the other day was really slow and they didn't change anything.  By being 
able to pull up a monitoring log and seeing what the system was doing at 
the time of the run might reveal their network was saturated and 
therefore their MPI job was impacted.  You can't very well turn on 
diagnostics and rerun the application because system conditions have 
probably changed.

Does that help?  Why don't you try installing collectl and see what it 
does...

-mark