[ofw][patch][ND provider] Improving latency of ms-mpi

Fab Tillier ftillier at microsoft.com
Thu Aug 6 14:36:27 PDT 2009


> From: Sean Hefty [mailto:sean.hefty at intel.com]
> Sent: Thursday, August 06, 2009 2:27 PM
>
>> Max inline data is not an in/out parameter to the
>> INDConnector::CreateEndpoint method.  I don't know if it makes sense to
>> have it be an input parameter. Aren't the proper tuning points
>> dependent on the HCA, rather than the
> app?
>
> This should be an app controlled parameter.  Using the max inline is not
> guaranteed to be faster unless all of your sends happen to be that exact
> size.

Why would all sends have to be the same size?  The inline tradeoff is between writing a 16-byte data segment, and then doing a DMA of the data, vs. copying the data direct to the SGE.  It shouldn't matter if the sends are all the same size.  There's a point where doing the copy is more efficient than setting up the data segment.

It's like passing data by value or by reference.

> Can someone provide a list of the drawbacks of using a larger max inline
> size? I believe it increases the amount of memory required by the QP,
> and increases the size of the transfers to the HCA across the PCI bus.
> Latency benchmarks may not care, but real application performance should
> be affected.

I believe it increases the send WQEs, which increases the stride between WQEs.  No idea how that affects things, though.

It should decrease the amount of data transferred across the PCI bus by 12 bytes (you still have a 4-byte header in the WQE to identify it as inline.  The data has to get transferred, you just safe transferring an 8 byte address and 4 byte LKEY.

>> Assuming that the tuning points are specific to the HCA models, does it
>> make sense to always allocate 400 bytes?  Is it always faster to inline
>> 400 bytes than to DMA the data for all HCAs (InfiniHost 3 LX, EX,
>> ConnectX, etc?)  It seems to me that having the inline data controlled
>> by the HCA driver rather than the ND provider would make more sense,
>> and allow the HCA driver to optimize the sweet spot.
>
> At the very least this should be tunable, but you really want this per
> application, not as a system wide setting.

I'm not convinced.  I think inline data is a parameter of the hardware, not the app.

-Fab



More information about the ofw mailing list