[ofw][patch][ND provider] Improving latency of ms-mpi

Leonid Keller leonid at mellanox.co.il
Sat Aug 8 11:30:32 PDT 2009


I've decreased the default value from 400 to 160 bytes to fit into Blue
Flame page of our cards.
Blue Flame works only with inline data and improve latency for about 200
ns.
Please, pay attention, that consumer can change this system wide default
value by defining environment variable.  
Also every application can adjust this parameter for its needs, coding
MaxInline parameter of CreateEndpoint method.
This latter parameter should be IN OUT, because the driver takes its
value as a hint.
It really re-calculates it, trying to maximize in the limits of WQE
size.
As far as WQE size is always a power of 2, using of MaxInline of 160
will cause WQE to be of size 256 and the returned MaxInlineDataSize will
be 192 or more dependent on QP type. 

Using of INLINE DATA spares HCA's second DMA access to system memory,
which improves latency noticably (esp. for short messages).
>From the other side, too large InlineSize value increases WQE size,
which slows a bit the work of the card (it reads all the WQE) and
increases memory usage.
So we've decided that it is application and not the driver, which is to
decide whether to use INLINE DATA facility or not. 

> -----Original Message-----
> From: Sean Hefty [mailto:sean.hefty at intel.com] 
> Sent: Friday, August 07, 2009 12:44 AM
> To: 'Fab Tillier'; Leonid Keller; Tzachi Dar
> Cc: ofw at lists.openfabrics.org
> Subject: RE: [ofw][patch][ND provider] Improving latency of ms-mpi
> 
> >Why would all sends have to be the same size?  The inline 
> tradeoff is 
> >between writing a 16-byte data segment, and then doing a DMA of the 
> >data, vs. copying the data direct to the SGE.  It shouldn't 
> matter if 
> >the sends are all the same size.  There's a point where 
> doing the copy 
> >is more efficient than setting up the data segment.
> 
> If you set max inline to 400, but do 16 byte transfers, 
> that's worse than setting max inline to 16.  There's more to 
> the cost of having a larger max inline value than copying 
> versus registering memory.  This is a property of the 
> application, not the hardware.
> 
> You need separate values for placing the data directly into 
> the SGL, versus avoiding a registration.
> 
> >It's like passing data by value or by reference.
> 
> If you add that the function should always take 100 
> parameters, then I'll agree.
> :)
> 
> - Sean
> 
> 



More information about the ofw mailing list