[ofw][patch][ND provider] Improving latency of ms-mpi

Sun Aug 9 02:07:44 PDT 2009

One more issue to note:
Using inline or not also depends on the system that is being used. For
example we have noticed that on Nehalem systems inline is more important
than on previous Intel processors. The amount of memory on the system as
well as the cluster size also have influence on the max inline size.
To understand this, think of a cluster with the size of N, and M cores
on each, where each cores talks to all other. => on each machine I'll
open N*M^2 QPs. I guess that if my cluster is small I can leave with
inline but the bigger it is, the bigger the number of memory that I want
to use.

As a result it is important to let the end user have some control over
the max inline size.

Thanks
Tzachi

> -----Original Message-----
> From: Leonid Keller 
> Sent: Saturday, August 08, 2009 9:31 PM
> To: Sean Hefty; 'Fab Tillier'; Tzachi Dar
> Cc: ofw at lists.openfabrics.org
> Subject: RE: [ofw][patch][ND provider] Improving latency of ms-mpi
> 
> I've decreased the default value from 400 to 160 bytes to fit 
> into Blue Flame page of our cards.
> Blue Flame works only with inline data and improve latency 
> for about 200 ns.
> Please, pay attention, that consumer can change this system 
> wide default value by defining environment variable.  
> Also every application can adjust this parameter for its 
> needs, coding MaxInline parameter of CreateEndpoint method.
> This latter parameter should be IN OUT, because the driver 
> takes its value as a hint.
> It really re-calculates it, trying to maximize in the limits 
> of WQE size.
> As far as WQE size is always a power of 2, using of MaxInline 
> of 160 will cause WQE to be of size 256 and the returned 
> MaxInlineDataSize will be 192 or more dependent on QP type. 
> 
> Using of INLINE DATA spares HCA's second DMA access to system 
> memory, which improves latency noticably (esp. for short messages).
> From the other side, too large InlineSize value increases WQE 
> size, which slows a bit the work of the card (it reads all 
> the WQE) and increases memory usage.
> So we've decided that it is application and not the driver, 
> which is to decide whether to use INLINE DATA facility or not. 
> 
> > -----Original Message-----
> > From: Sean Hefty [mailto:sean.hefty at intel.com]
> > Sent: Friday, August 07, 2009 12:44 AM
> > To: 'Fab Tillier'; Leonid Keller; Tzachi Dar
> > Cc: ofw at lists.openfabrics.org
> > Subject: RE: [ofw][patch][ND provider] Improving latency of ms-mpi
> > 
> > >Why would all sends have to be the same size?  The inline
> > tradeoff is
> > >between writing a 16-byte data segment, and then doing a 
> DMA of the 
> > >data, vs. copying the data direct to the SGE.  It shouldn't
> > matter if
> > >the sends are all the same size.  There's a point where
> > doing the copy
> > >is more efficient than setting up the data segment.
> > 
> > If you set max inline to 400, but do 16 byte transfers, 
> that's worse 
> > than setting max inline to 16.  There's more to the cost of 
> having a 
> > larger max inline value than copying versus registering 
> memory.  This 
> > is a property of the application, not the hardware.
> > 
> > You need separate values for placing the data directly into 
> the SGL, 
> > versus avoiding a registration.
> > 
> > >It's like passing data by value or by reference.
> > 
> > If you add that the function should always take 100 
> parameters, then 
> > I'll agree.
> > :)
> > 
> > - Sean
> > 
> >