[ofa-general] IB performance stats (revisited)

Jason Gunthorpe jgunthorpe at obsidianresearch.com
Wed Jun 27 15:46:05 PDT 2007


On Wed, Jun 27, 2007 at 05:44:36PM -0400, Hal Rosenstock wrote:
> On Wed, 2007-06-27 at 17:26, Jason Gunthorpe wrote:
> > On Wed, Jun 27, 2007 at 05:13:40PM -0400, Hal Rosenstock wrote:
> > 
> > > > - The kernel periodically fetches the performance stats and aggregates
> > > >   them into a 64 wrapping counter. The kernel sends PMA mads into the
> > > >   mellanox firmware to read and reset the counters
> > > > - The new 64 bit stats are exported via sysfs/proc/whatever as
> > > >   wrapping counters
> > > > - When a PMA packet comes in the kernel services it rather than
> > > >   passing it on to the chip firmware.
> > > 
> > > In this way, both 32 and 64 bit counters could be presented by the PMA
> > > but how would it know when the a counter has maxed out in terms of the
> > > PMA and how would a remote clear be handled ?
> > 
> > Each time the counter is cleared
> 
> So it doesn't matter whether the clear is local (from Linux) or remote
> (from IB), right ?
> 
> >  the kernel would store the 64 bit
> > value as the 'last PMA counter'. Then the calculation is just
> > 
> > if ((current - stored) >= saturation)
> >   return saturation;
> > return current - stored;
> > 
> > After 2**64 counts the saturation computation will stop working. It
> > would take 24 years of constant maxed out data transfer for a 12x QDR
> > link to wrap a 64 bit dword byte counter.
> 
> Is that even for the 4 octet counts ? (I didn't calculate this out).

Okay, I think a few details of this idea are being missed here..

The 64 bit non-saturating counter is internal to the Linux kernel and
is exported by sysfs/proc/netlink/whatever. Someday if we feel
necessary we could make it a 128 bit counter without affecting any of
the APIs, wire protocols/etc. 64 bits seems to be the common counter
size for other linux network performance counts today.

Using that 64 bit counter we can emulate the current IBA PMA
specifications and have it saturate at 32 bits. This means we can
co-opt the PMA interface to the chip's firwmare to extract the
counters and provide a new PMA in the Linux kernel that supports:
 1) non-saturating 64 bit counters in proc/etc for userspace
    ** This could be used by a SNMP module to export them off
       the node, or by any number of local utilities.
 2) saturating 32 bit counters for IBA PM MADs
 3) saturating 64 bit counters for new IBA PM MADs

All this would work with at least mellanox and qlogic hardware. In
future we'd want hardware to provide direct access to non-saturating
32 or 64 bit counters to avoid the mess with speaking PMA to the chip
firmware.

The 24 years I talked about before is how long it would take for the
algorithm I described to improperly report a non-saturated value if no
PMA counter clears were done. With a timer and an additional flag you
could make it perfect.. By my math a 32 bit dword counter will reach
saturation on a 12xQDR link in 1.4 seconds and a 4xSDR will be in
17s

Actually, I see I was off, I was counting bits not bytes, it will take
192 years, not 24 to improperly report non-saturation at 100gigabits (!)

> The question may now be how to get from where we are today to this
> model.

Someone has to code it ;> The qlogic driver already has alot of a PMA
in it, so factoring that to common code and requiring a new data
collection call back from the drivers seems like a reasonable start..

-- 
Jason Gunthorpe <jgunthorpe at obsidianresearch.com>        (780)4406067x832
Chief Technology Officer, Obsidian Research Corp         Edmonton, Canada



More information about the general mailing list