[openib-general] [PATCH] osm: PathRecord prefer 1K MTU for MT23108 devices

Mon Sep 18 08:20:07 PDT 2006

Hi Todd,

Seems like your knowledge about the specific MTU best for the 
application (MPI) you are running is good
enough such that you will be able to include the MTU in the PathRecord 
request and thus the patch describe in here will not affect your MPI at all.
The patch only applies if your request does not  provide any MTU & MTU 
SEL comp_mask

EZ

Rimmer, Todd wrote:

>>From: Or Gerlitz
>>Sent: Monday, September 18, 2006 5:45 AM
>>To: Michael S. Tsirkin
>>Cc: OPENIB
>>Subject: Re: [openib-general] [PATCH] osm: PathRecord prefer 1K MTU
>>    
>>
>for
>  
>
>>MT23108 devices
>>
>>Michael S. Tsirkin wrote:
>>    
>>
>>>Quoting r. Or Gerlitz <ogerlitz at voltaire.com>:
>>>      
>>>
>>>>Eitan Zahavi wrote:
>>>>        
>>>>
>>>>>The following patch solves an issue with OpenSM preferring largest
>>>>>          
>>>>>
>MTU
>  
>
>>>>>for PathRecord/MultiPathRecord for paths going to or from MT23108
>>>>>          
>>>>>
>>(Tavor)
>>    
>>
>>>>>devices instead of using a 1K MTU which is best for this device.
>>>>>          
>>>>>
>>>>Isn't the 2K MTU issue with Tavor comes into play only under RC QP?
>>>>        
>>>>
>>>I don't think so, no. Tavor supports 2K MTU, but it has better
>>>      
>>>
>>performance with
>>    
>>
>>>1K MTU than 2K MTU. QP type should not matter.
>>>      
>>>
>>Can you double check that please, as far as i know there is something
>>like BW 40-50% drop with Tavor/RC/2048 vs Tavor/RC/1024 but the BW
>>    
>>
>with
>  
>
>>Tavor/UD/2048 is **no less** then Tavor/UD/1024.
>>
>>So its very common for IPoIB net devices impl. to expose 2044 or 1500
>>bytes MTU to the OS eg to cope with Ethernet and reduce IP
>>fragmentation/reassembly of UDP/TCP traffic.
>>
>>    
>>
>
>Putting this in the SM alone and making it a fabric wide setting is
>inappropriate.  The performance difference depends on application
>message size.  Application message size can vary per ULP and/or per
>application itself.  For example one MPI application may send mostly
>large messages while another may send mostly small messages.  The same
>could be true of applications for other ULPs such as uDAPL and SDP, etc.
>
>The root issue is the Tavor HCA has 1 too few credits to truly double
>buffer at 2K MTU.  However at message sizes > 1K but < 2K the 2K MTU
>performs better.
>
>Here are some MPI bandwidth results:
>Tavor w/ 2K MTU:
>512             140.394173
>1024            310.553002
>1500            407.003858
>1800            435.538752
>2048            392.831026
>4096            417.592991
>
>Tavor w/ 1K MTU:
>512             140.261964
>1024            300.789425
>1500            379.746835
>1800            416.726957
>2048            425.227096
>4096            501.442289
>
>Note that message sizes shown on left do not include MPI headers.  Hence
>actual IB message size is approx 50 bytes larger.
>
>So we see at IB message sizes < 1024 (MPI 512 message), performance is
>the same.
>At IB message sizes > 1024 < 2048 (MPI 1024-1800 messages), performance
>is best with 2K MTU.
>At IB message sizes > 2048 (MPI 2048-4096 messages above), performance
>is best with 1K MTU.
>At larger IB message sizes (MPI 4096 message), performance starts to
>take off and ultimately at 128K message size (not shown) the 50%
>difference between 1K and 2K MTU reaches its peak.
>
>Todd Rimmer
>
>_______________________________________________
>openib-general mailing list
>openib-general at openib.org
>http://openib.org/mailman/listinfo/openib-general
>
>To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
>  
>