[ofa-general] Problems using ofed 1.4.2 and Infinipath cards

Ralph Campbell ralph.campbell at qlogic.com
Wed Aug 26 12:01:27 PDT 2009


Is your switch configured for 4K MTU?
The default openmpi parameter for QLogic is to use a 4K MTU.
Try using a 2K MTU with:
"mpirun -mca btl_openib_mtu=4 ..." and see if that works.


On Wed, 2009-08-26 at 02:09 -0700, Ole Widar Saastad wrote:
> I am experiencing problems using the Infinipath cards and the OFED
> stack. (details are given below). 
> 
> It seems to be a problem somewhere when mpi packet size grows above 2k.
> This is what I recall the changeover from one transport mechanism to
> another ?
> 
> The test is easy to run and to test, it is just a bandwidth program :
> (I got far better latency using the Pathscale stack that the OFED. Is this 
> something that will be looked up in the newer releases?).
> 
> Two nodes in node.txt file compute-1-0 and compute-1-1. They are connected
> to a SilverStorm switch.
> 
> [olews at login-0-2 bandwidth]$ mpirun -np 2 -machinefile ./nodes.txt ./bandwidth.openmpi.x -b o
> Resolution (usec): 2.145767
> Benchmark ping-pong
> ===================
>         lenght     iterations   elapsed time  transfer rate        latency
>        (bytes)        (count)      (seconds)     (Mbytes/s)         (usec)
> --------------------------------------------------------------------------
>              0          10046          0.121          0.000          6.011
>              1          10261          0.124          0.166          6.026
> <cut a few lines>
>           1024           7695          0.140        112.615          9.093
>           1536           6260          0.133        144.469         10.632
>           2048           5275          0.128        168.420         12.160
> [0,1,0][btl_openib_component.c:1375:btl_openib_component_progress] from compute-1-0 to: compute-1-1 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 278309104 opcode 1
> --------------------------------------------------------------------------
> The InfiniBand retry count between two MPI processes has been
> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
> 
>     The total number of times that the sender wishes the receiver to
>     retry timeout, packet sequence, etc. errors before posting a
>     completion error.
> 
> This error typically means that there is somethin/site/VERSIONS/openmpi-1.2.8.gnu/bin/g awry within the
> InfiniBand fabric itself.  You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.  
> 
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
> 
> * btl_openib_ib_retry_count - The number of times the sender will
>   attempt to retry (defaulted to 7, the maximum value).
> 
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>   to 10).  The actual timeout value used is calculated as:
> 
>      4.096 microseconds * (2^btl_openib_ib_timeout)
> 
>   See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> --------------------------------------------------------------------------
> mpirun noticed that job rank 1 with PID 9184 on node compute-1-1 exited on signal 15 (Terminated). 
> [olews at login-0-2 bandwidth]$ 
> 
> 
> Background information :
> 
> 
> 07:00.0 InfiniBand: QLogic, Corp. InfiniPath PE-800 (rev 02)
>         Subsystem: QLogic, Corp. InfiniPath PE-800
>         Flags: bus master, fast devsel, latency 0, IRQ 66
>         Memory at fde00000 (64-bit, non-prefetchable) [size=2M]
>         Capabilities: [40] Power Management version 2
>         Capabilities: [50] Message Signalled Interrupts: 64bit+
> Queue=0/0 Enable+
>         Capabilities: [70] Express Endpoint IRQ 0
> 
> compute-1-0.local# uname -a
> Linux compute-1-0.local 2.6.18-92.1.13.el5 #1 SMP Wed Sep 24 19:32:05
> EDT 2008 x86_64 x86_64 x86_64 GNU/Linux
> compute-1-0.local# 
> 
> 
> compute-1-0.local# rpm -qa| grep ofed
> libibverbs-utils-1.1.2-1.ofed1.4.2
> librdmacm-utils-1.0.8-1.ofed1.4.2
> libcxgb3-1.2.2-1.ofed1.4.2
> ofed-scripts-1.4.2-0
> libmlx4-1.0-1.ofed1.4.2
> libibverbs-devel-1.1.2-1.ofed1.4.2
> ofed-docs-1.4.2-0
> ibvexdmtools-0.0.1-1.ofed1.4.2
> libmthca-1.0.5-1.ofed1.4.2
> libipathverbs-1.1-1.ofed1.4.2
> mstflint-1.4-1.ofed1.4.2
> libibumad-1.2.3_20090314-1.ofed1.4.2
> libnes-0.6-1.ofed1.4.2
> libibcommon-1.1.2_20090314-1.ofed1.4.2
> libibverbs-1.1.2-1.ofed1.4.2
> librdmacm-1.0.8-1.ofed1.4.2
> qlgc_vnic_daemon-0.0.1-1.ofed1.4.2
> compute-1-0.local# 
> 
> OpenMPI is :
> openmpi-1.2.8 compiled for gcc.
> 




More information about the general mailing list