[openib-general] [PATCH] osm: PathRecord prefer 1K MTU for MT23108 devices

Mon Sep 18 08:52:18 PDT 2006

> From: Eitan Zahavi [mailto:eitan at mellanox.co.il]
> Sent: Monday, September 18, 2006 11:20 AM
> To: Rimmer, Todd
> Cc: Or Gerlitz; Michael S. Tsirkin; OPENIB
> Subject: Re: [openib-general] [PATCH] osm: PathRecord prefer 1K MTU
for
> MT23108 devices
> 
> Hi Todd,
> 
> Seems like your knowledge about the specific MTU best for the
> application (MPI) you are running is good
> enough such that you will be able to include the MTU in the PathRecord
> request and thus the patch describe in here will not affect your MPI
at
> all.
> The patch only applies if your request does not  provide any MTU & MTU
> SEL comp_mask

Eitan,

The question is not about "our MPI", rather its to ensure the Open
Fabrics and OFED included MPIs and ULPs are capable of being tuned for
optimal performance.  When a fabric runs more than 1 application, its
necessary to be able to tune this at the MPI, SDP, etc level, not at the
SM level.

This patch turns on a non-standard behaviour in the SM for the entire
fabric such that some applications will have better performance while
others will suffer.  In order to be complete, this patch would need to
include ULP level tunability in all the relevant ULPs (MPI, SDP, uDAPL,
etc) to select the "MAX MTU" to use or to request.

This then begs the question, if proper tuning requires all the ULPs to
have a configurable MAX MTU, why should the SA need to implement the
quirk at all?

Todd Rimmer

> >
> >Putting this in the SM alone and making it a fabric wide setting is
> >inappropriate.  The performance difference depends on application
> >message size.  Application message size can vary per ULP and/or per
> >application itself.  For example one MPI application may send mostly
> >large messages while another may send mostly small messages.  The
same
> >could be true of applications for other ULPs such as uDAPL and SDP,
etc.
> >
> >The root issue is the Tavor HCA has 1 too few credits to truly double
> >buffer at 2K MTU.  However at message sizes > 1K but < 2K the 2K MTU
> >performs better.
> >
> >Here are some MPI bandwidth results:
> >Tavor w/ 2K MTU:
> >512             140.394173
> >1024            310.553002
> >1500            407.003858
> >1800            435.538752
> >2048            392.831026
> >4096            417.592991
> >
> >Tavor w/ 1K MTU:
> >512             140.261964
> >1024            300.789425
> >1500            379.746835
> >1800            416.726957
> >2048            425.227096
> >4096            501.442289
> >
> >Note that message sizes shown on left do not include MPI headers.
Hence
> >actual IB message size is approx 50 bytes larger.
> >
> >So we see at IB message sizes < 1024 (MPI 512 message), performance
is
> >the same.
> >At IB message sizes > 1024 < 2048 (MPI 1024-1800 messages),
performance
> >is best with 2K MTU.
> >At IB message sizes > 2048 (MPI 2048-4096 messages above),
performance
> >is best with 1K MTU.
> >At larger IB message sizes (MPI 4096 message), performance starts to
> >take off and ultimately at 128K message size (not shown) the 50%
> >difference between 1K and 2K MTU reaches its peak.
> >
> >Todd Rimmer
> >
> >_______________________________________________
> >openib-general mailing list
> >openib-general at openib.org
> >http://openib.org/mailman/listinfo/openib-general
> >
> >To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-
> general
> >
> >