[ofa-general] IB performance stats (revisited)

Wed Jul 11 07:29:56 PDT 2007

Hi Marc,

I published an RFC and later had discussions regarding the distribution
of query ownership of switch counters.
Making this ownership purely dynamic, semi-dynamic or even static is an
implementation tradeoff.
However, it can be shown that the maximal number of switches a single
compute node would be responsible for is <= number of switch levels. So
no problem to get counters every second...

The issue is: what do you do with the size of data collected?
This is only relevant if monitoring is run in "profiling mode" otherwise
only link health errors should be reported.

My proposal is to have a reporting algorithm that reports only "change
of data rate" with "change" being defined "adaptively" . In other words:

A node should report upstream change of port activity only if the rate
of data changed by more then X times.
Assuming we want logarithmic scale  X == 2 would work like that:

At first sample there is no traffic. All counters will need t make their
way to the "master" node. 
When traffic starts a change of data rate which is infinite will cause
all new rates X to be sent.
>From that moment only ports which their data rate will reach 2X or 0.5X
will be reported.

Integration period should be configurable.

Hope I had time to implement ...

Eitan Zahavi
Senior Engineering Director, Software Architect
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL

> -----Original Message-----
> From: Mark Seger [mailto:Mark.Seger at hp.com] 
> Sent: Wednesday, July 11, 2007 5:16 PM
> To: Hal Rosenstock
> Cc: Eitan Zahavi; Ira Weiny; general at lists.openfabrics.org; 
> Ed.Finn at FMR.COM
> Subject: Re: [ofa-general] IB performance stats (revisited)
> 
> My basic philosophy, and I suspect there are those who might 
> disagree, is that you can't use the network to monitor the 
> network, at least not in times of trouble.  That's why I 
> insist on having to query the HCAs directly since I can't 
> always be sure the network is there and/or reliable.  If you 
> are willing to concede that this can indeed happen than the 
> question becomes one of how do you reliably get data from an 
> HCA and that's the basis for my (re)starting this discussion.
> 
> As for querying the switch for counters, what do you do on a 
> very large network, say 10s of thousands of nodes if you want 
> to get performance data every second?  I also realize this is 
> an extreme situation today (the node count not the frequency 
> of monitoring) but I'm sure everyone would agree systems of 
> these sizes are not that far off.
> 
> -mark
> 
> Hal Rosenstock wrote:
> 
> >Hi Eitan,
> >
> >On Wed, 2007-07-11 at 06:51, Eitan Zahavi wrote:
> >  
> >
> >>Hi Ira,
> >>
> >>    
> >>
> >>>Second, I have run some tests querying the fabric of our large 
> >>>clusters here (~500 nodes) and the results were promising for a 
> >>>single node implementation.
> >>>I don't recall the numbers as this was a while ago but it 
> was on the 
> >>>order of
> >>><2 sec and I think <1 but I don't want to be misquoted.
> >>>      
> >>>
> >>Does PerfMgr query switch ports ?
> >>    
> >>
> >
> >Yes (of course it does).
> >
> >  
> >
> >>If it does I am surprised by the short sweep time you got.
> >>
> >>Does it have >1 query on the wire at a given time?
> >>    
> >>
> >
> >Yes, Default appears to be 500 currently (maybe that needs 
> dialing back 
> >a bit) but is settable via perfmgr_max_outstanding_queries 
> in options 
> >file.
> >
> >  
> >
> >>If not then I am even more surprised.
> >>
> >>Was the cluster running a job at the time of the query ?
> >>    
> >>
> >
> >Is this question related to VL0 contention ?
> >
> >-- Hal
> >
> >  
> >
> >>Thanks
> >>
> >>Eitan Zahavi
> >>Senior Engineering Director, Software Architect Mellanox 
> Technologies 
> >>LTD
> >>Tel:+972-4-9097208
> >>Fax:+972-4-9593245
> >>P.O. Box 586 Yokneam 20692 ISRAEL
> >>
> >> 
> >>
> >>    
> >>
> >>>-----Original Message-----
> >>>From: Ira Weiny [mailto:weiny2 at llnl.gov]
> >>>Sent: Tuesday, July 10, 2007 7:47 PM
> >>>To: Eitan Zahavi
> >>>Cc: halr at voltaire.com; Mark.Seger at hp.com; 
> >>>general at lists.openfabrics.org; Ed.Finn at FMR.COM
> >>>Subject: Re: [ofa-general] IB performance stats (revisited)
> >>>
> >>>On Thu, 28 Jun 2007 10:24:59 +0300
> >>>"Eitan Zahavi" <eitan at mellanox.co.il> wrote:
> >>>
> >>>      
> >>>
> >>>>>On Wed, 2007-06-27 at 14:23, Eitan Zahavi wrote:
> >>>>>          
> >>>>>
> >>>>>>In the last months it is the second time I hear people
> >>>>>>            
> >>>>>>
> >>>>>complaining the
> >>>>>          
> >>>>>
> >>>>>>current monitoring solution in OFA is  integrated with OpenSM.
> >>>>>>            
> >>>>>>
> >>>>>I must have missed this both times (didn't see this in Mark's
> >>>>>post) and the statement itself is somewhat inaccurate as well.
> >>>>>          
> >>>>>
> >>>>Private talks - I hope they will speak up for themselves now...
> >>>>        
> >>>>
> >>>>>>These people do not use OpenSM but do use OFED.
> >>>>>>            
> >>>>>>
> >>>>>I'm not sure I'm following what you mean here.
> >>>>>
> >>>>>If you mean that some people want to run PerfMgr without
> >>>>>          
> >>>>>
> >>>the SM/SA
> >>>      
> >>>
> >>>>>aspects (so that they can run a vendor based SM), that is
> >>>>>          
> >>>>>
> >>>the next
> >>>      
> >>>
> >>>>>thing we are adding to the implementation.
> >>>>>          
> >>>>>
> >>>>Exactly. OK when is that coming?
> >>>>        
> >>>>
> >>>There is very little which ties the current PerfMgr to OpenSM.  
> >>>Basically it just gets the current fabric topology.
> >>>As Hal has said changes are coming.
> >>>
> >>>      
> >>>
> >>>>>> Another drawback if that
> >>>>>>no naming is provided and the reporting uses GUIDs.
> >>>>>>            
> >>>>>>
> >>>>>Naming is provided via NodeDescription.
> >>>>>          
> >>>>>
> >>>>This might be good for hosts but is not covering  switches ...
> >>>>        
> >>>>
> >>>It does include switches.  However, since most systems 
> have the same 
> >>>name for multiple switches this becomes ineffective.
> >>> I have queried Voltaire for a way to change the 
> NodeDescription for 
> >>>switches, but at the time I asked, there was no way to do it.  
> >>>Perhaps there is now?  What about other vendors?  This is why 
> >>>ibnetdiscover and other diags have "switch map" support.  (A 
> >>>GUID->name mapping to override the default 
> NodeDescription.) Nothing 
> >>>would please me more than to be able to remove that for a more 
> >>>"automatic" solution.
> >>>
> >>>      
> >>>
> >>>>>>I also can't hold myself from saying again I think you
> >>>>>>            
> >>>>>>
> >>>are going
> >>>      
> >>>
> >>>>>>to hit the wall with the concept of doing the PMA from
> >>>>>>            
> >>>>>>
> >>>a single node.
> >>>      
> >>>
> >>>>>If you are referring to the fact the PerMgr is currently not 
> >>>>>distributed, that will be done as has been stated before.
> >>>>>          
> >>>>>
> >>>>Good. When is it expected? Will it be OFED 1.3?
> >>>>        
> >>>>
> >>>When Hal first sent out the PerfMgr design I thought we 
> should jump 
> >>>right to the distributed model as well.  But now I am glad we have 
> >>>gone the way we did.
> >>>First off, we have something which "works" and from which we can 
> >>>expand.
> >>>Second, I have run some tests querying the fabric of our large 
> >>>clusters here (~500 nodes) and the results were promising for a 
> >>>single node implementation.
> >>>I don't recall the numbers as this was a while ago but it 
> was on the 
> >>>order of
> >>><2 sec and I think <1 but I don't want to be misquoted.
> >>>
> >>>For sure, a distributed model offers many advantages and 
> we will get 
> >>>there.  But for many the current single node approach should work 
> >>>just fine.
> >>>
> >>>Thanks,
> >>>Ira
> >>>
> >>>      
> >>>
> >>>>Thanks
> >>>>        
> >>>>
> >>>>>-- Hal
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>Eitan Zahavi
> >>>>>>Senior Engineering Director, Software Architect Mellanox
> >>>>>>            
> >>>>>>
> >>>>>Technologies
> >>>>>          
> >>>>>
> >>>>>>LTD
> >>>>>>Tel:+972-4-9097208
> >>>>>>Fax:+972-4-9593245
> >>>>>>P.O. Box 586 Yokneam 20692 ISRAEL
> >>>>>>
> >>>>>> 
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>>>-----Original Message-----
> >>>>>>>From: general-bounces at lists.openfabrics.org
> >>>>>>>[mailto:general-bounces at lists.openfabrics.org] On
> >>>>>>>              
> >>>>>>>
> >>>Behalf Of Hal
> >>>      
> >>>
> >>>>>>>Rosenstock
> >>>>>>>Sent: Wednesday, June 27, 2007 8:12 PM
> >>>>>>>To: Mark Seger
> >>>>>>>Cc: Finn, Ed; general at lists.openfabrics.org
> >>>>>>>Subject: Re: [ofa-general] IB performance stats (revisited)
> >>>>>>>
> >>>>>>>On Wed, 2007-06-27 at 13:07, Mark Seger wrote:
> >>>>>>>              
> >>>>>>>
> >>>>>>>>>The performance managers deal with the counter
> >>>>>>>>>                  
> >>>>>>>>>
> >>>stickiness (by
> >>>      
> >>>
> >>>>>>>>>resetting them when they think they need to). They
> >>>>>>>>>                  
> >>>>>>>>>
> >>>>>>>typically export
> >>>>>>>              
> >>>>>>>
> >>>>>>>>>their data although this is not specified by IBA so it is
> >>>>>>>>>                  
> >>>>>>>>>
> >>>>>>>in a vendor
> >>>>>>>              
> >>>>>>>
> >>>>>>>>>proprietary manner.
> >>>>>>>>> 
> >>>>>>>>>
> >>>>>>>>>                  
> >>>>>>>>>
> >>>>>>>>so I guess these guys are poor citizens as well...
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>Not sure what you mean.
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>the real issue as I see it then means nobody can trust
> >>>>>>>>                
> >>>>>>>>
> >>>>>the data if
> >>>>>          
> >>>>>
> >>>>>>>>randon tools randomly reset the counters.  a real shame...
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>I consider this to be a real rather than random app for this. 
> >>>>>>>Guess it depends on what one considers random.
> >>>>>>>
> >>>>>>>-- Hal
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>>>>-mark
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>                
> >>>>>>>>
> >>>>>>>_______________________________________________
> >>>>>>>general mailing list
> >>>>>>>general at lists.openfabrics.org
> >>>>>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>>>>>
> >>>>>>>To unsubscribe, please visit
> >>>>>>>http://openib.org/mailman/listinfo/openib-general
> >>>>>>>
> >>>>>>>              
> >>>>>>>
> >>>>>          
> >>>>>
> >>>>_______________________________________________
> >>>>general mailing list
> >>>>general at lists.openfabrics.org
> >>>>http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> >>>>
> >>>>To unsubscribe, please visit
> >>>>http://openib.org/mailman/listinfo/openib-general
> >>>>
> >>>>        
> >>>>
> 
>