[openib-general] [PATCH for-2.6.18] IB/cma: option to limitMTU to 1K

Wed Sep 13 15:01:03 PDT 2006

Quoting r. Rimmer, Todd <trimmer at silverstorm.com>:
> Subject: RE: [openib-general] [PATCH for-2.6.18] IB/cma: option to limitMTU to 1K
> 
> > From: Sean Hefty
> > Sent: Wednesday, September 13, 2006 5:23 PM
> > To: Michael S. Tsirkin
> > Cc: openib-general at openib.org
> > Subject: Re: [openib-general] [PATCH for-2.6.18] IB/cma: option to
> > limitMTU to 1K
> > 
> > Michael S. Tsirkin wrote:
> > >>Although, I don't like the idea of the CMA changing every path to
> use an
> > MTU of
> > >>1k.
> > >
> > > Well, that's why it's off by default.
> > > So, Ack?
> > 
> > I'd like to find a way to support a 1k MTU to tavor HCAs without
> making
> > the MTU
> > 1k to other HCAs, in case we're dealing with a heterogeneous
> environment.
> > 
> > Is this really the responsibility of the querying node or the SA?
> > 
> > - Sean
> > 
> 
> The real issue here is how to handle "optimization" tricks for selected
> models of HCAs.  While Tavor supports a 2K MTU and works with it, it has
> been found to offer better MPI bandwidth when running 1K MTU.  For many
> other ULPs no difference in performance is observable (because many
> other ULPs don't stress the HCA the way MPI bandwidth benchmarks do).
> 
> Another dimension to this problem is that its not clear what the best
> optimization will be in heterogeneous environments.  Such as a Tavor HCA
> talking to a Sinai, Arbel or other type of TCA based device using a
> non-MPI protocol (such as a storage target).  In those environments a 2K
> MTU may perform the same (or depending on the storage target, perhaps
> even better).

If Tavor is involved at either end, 1K MTU is better than 2K MTU.

> At this point I would suggest this is a subtle performance issue
> specific to MPI 

This is not specific to MPI. All ULPs experience this issue.

> and MPI libraries can appropriately provide options to
> tune the maximum MTU MPI to use or request (which is only one of dozens
> of MPI tunables needed to fine tune MPI).  MPI environments will tend to
> be more homogeneous which also simplifies the solution.
> 
> Pushing these types of ULP and source/destination specific issues into
> the core stack or SM will get very complex very quick.

It's actually relatively simple.

> Given the issue
> on the table (Tavor performance) is specific to an older HCA model, it
> may not even be that critical since the highest performance customers
> have long since moved toward PCIe and DDR fabrics, neither of which are
> supported by Tavor.

All the more reason to pt the simple logic in one place
and not expect all apprlications to optimize for this hardware.

-- 
MST