[ewg] 200m cable results in slower rdma read performance? [ CC Anti-Virus checked ]

richard Croucher Richard.croucher at informatix-sol.com
Fri Oct 21 12:08:52 PDT 2011


I would expect throughput to decline geometrically with distance.
InfiniBand uses buffer-credits in a similar way  as FibreChannel.  
Because most clusters are in close proximity the hardware has very
limited buffering.  I can well believe that this is measurable at 200
metres at 40G.

Because of the speed of InfiniBand, these buffers are integrated within
the ASICs so there are no options to change them.  I think they are in
64 bytes chunks to match the transmission units as InfiniBand is cut
through.

There are long haul devices available for InfiniBand, as there is with
Fibrechannel, which provide the additional buffers necessary to support
longer distances.  I've successfully run at full line rate over 100's of
KM's using these.
-- 

Richard Croucher
www.informatix-sol.com
+44-7802-213901 

On Fri, 2011-10-21 at 09:56 +0200, koen_segers at computacenter.com wrote:
> Hi Rupert,
> 
>  
> 
> The firmware of the hca’s is updated to the latest stable version. We
> are still seeing the same issue.
> 
> Updating the ofed library will be more difficult. Do you really think
> this will be the reason?
> 
>  
> 
> In Ethernet it is common sense to make the tcp buffers larger for
> high-throughput or long latency networks.
> 
> Is there something similar in Infiniband?
> 
>  
> 
> Best regards,
> 
>  
> 
> 
>  
> 
> Koen Segers
> 
> Enterprise Consultant
> 
>  
> 
> Computacenter
> 
> Services & Solutions
> 
>  
> 
> Ikaroslaan 31
> 
> B-1930 Zaventem
> 
> Belgium
> 
>  
> 
> Tel: +32 2 704 94 67
> 
> Fax: +32 2 704 95 95
> 
> Mob: +32 497 909353
> 
> koen_segers at computacenter.com
> 
> www.computacenter.com/benelux
> 
> 
> 
>  
> 
> 
> From: Rupert Dance <rsdance at soft-forge.com> [mailto:Rupert Dance
> <rsdance at soft-forge.com>] 
> Sent: 17 October 2011 16:26
> To: <koen_segers at computacenter.com>; <ewg at lists.openfabrics.org>
> Subject: RE: [ewg] 200m cable results in slower rdma read performance?
> [ CC Anti-Virus checked ]
> 
> 
> 
>  
> 
> Koen,
> 
>  
> 
> Can you try running ibdiagnet –P all=1 –ls 10 –lw 4x
> 
>  
> 
> This will tell us if any links are not running at Link speed of 10
> (QDR) and Link Width of 4x.
> 
>  
> 
> You may also want to suggest an upgrade of OFED to 1.5.3.2 GA. There
> have been major improvements in the stack since 1.4.2. Also please be
> sure that you update the firmware in all hardware for the same reason.
> 
>  
> 
> Thanks
> 
>  
> 
> Rupert
> 
>  
> 
> 
> From: koen_segers at computacenter.com
> [mailto:koen_segers at computacenter.com] 
> Sent: Monday, October 17, 2011 9:22 AM
> To: rsdance at soft-forge.com; ewg at lists.openfabrics.org
> Cc: koen_segers at computacenter.com
> Subject: RE: [ewg] 200m cable results in slower rdma read performance?
> [ CC Anti-Virus checked ]
> 
> 
> 
>  
> 
> Rupert,
> 
>  
> 
> Thanks for replying. 
> 
>  
> 
> Below is the output of the ibdiagnet command.
> 
> I don’t see any issues here. Just  tell me if you need more info.
> 
>  
> 
> I forgot to mention that we are using the following switch version:
> 
> edgeprod1# version show
> 
>         version: 3.6.0
> 
>         date:    Jun 07 2011 11:19:33 AM
> 
>         build Id:857
> 
>  
> 
> And the default SLES 11 SP1 ofed build: ofed-1.4.2-0.9.6
> 
>  
> 
> Best regards,
> 
>  
> 
> 15:00:28|root at gpfsprod1n1:~ 0 # ibdiagnet
> 
> Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
> 
> -W- Topology file is not specified.
> 
>     Reports regarding cluster links will use direct routes.
> 
> Loading IBDM from: /usr/lib64/ibdm1.2
> 
> -W- A few ports of local device are up.
> 
>     Since port-num was not specified (-p option), port 1 of device 1
> will be
> 
>     used as the local port.
> 
> -I- Discovering ... 39 nodes (6 Switches & 33 CA-s) discovered.
> 
>  
> 
>  
> 
> -I---------------------------------------------------
> 
> -I- Bad Guids/LIDs Info
> 
> -I---------------------------------------------------
> 
> -I- No bad Guids were found
> 
>  
> 
> -I---------------------------------------------------
> 
> -I- Links With Logical State = INIT
> 
> -I---------------------------------------------------
> 
> -I- No bad Links (with logical state = INIT) were found
> 
>  
> 
> -I---------------------------------------------------
> 
> -I- PM Counters Info
> 
> -I---------------------------------------------------
> 
> -I- No illegal PM counters values were found
> 
>  
> 
> -I---------------------------------------------------
> 
> -I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts
> list)
> 
> -I---------------------------------------------------
> 
> -I-    PKey:0x7fff Hosts:65 full:65 partial:0
> 
>  
> 
> -I---------------------------------------------------
> 
> -I- IPoIB Subnets Check
> 
> -I---------------------------------------------------
> 
> -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps
> SL:0x00
> 
> -W- Suboptimal rate for group. Lowest member rate:40Gbps >
> group-rate:10Gbps
> 
>  
> 
> -I---------------------------------------------------
> 
> -I- Bad Links Info
> 
> -I- No bad link were found
> 
> -I---------------------------------------------------
> 
> ----------------------------------------------------------------
> 
> -I- Stages Status Report:
> 
>     STAGE                                    Errors Warnings
> 
>     Bad GUIDs/LIDs Check                     0      0
> 
>     Link State Active Check                  0      0
> 
>     Performance Counters Report              0      0
> 
>     Partitions Check                         0      0
> 
>     IPoIB Subnets Check                      0      1
> 
>  
> 
> Please see /tmp/ibdiagnet.log for complete log
> 
> ----------------------------------------------------------------
> 
>  
> 
> -I- Done. Run time was 5 seconds.
> 
>  
> 
> 
>  
> 
> This type of info is given in ibdiagnet.lst:
> 
>  
> 
> { SW Ports:24 SystemGUID:0008f10500108fa9 NodeGUID:0008f10500108fa8
> PortGUID:0008f10500108fa8 VenID:000008F1 DevID:5A5A0000 Rev:000000A1
> {Voltaire
> 
> 4036 # edgeprod3} LID:0001 PN:05 } { CA Ports:02
> SystemGUID:0002c903004ab175 NodeGUID:0002c903004ab172
> PortGUID:0002c903004ab173 VenID:000002C9 D
> 
> evID:673C0000 Rev:000000B0 { HCA-1} LID:001A PN:01 } PHY=4x LOG=ACT
> SPD=10
> 
>  
> 
>  
> 
>  
> 
>  
> 
> Koen Segers
> 
> Enterprise Consultant
> 
>  
> 
> Computacenter
> 
> Services & Solutions
> 
>  
> 
> Ikaroslaan 31
> 
> B-1930 Zaventem
> 
> Belgium
> 
>  
> 
> Tel: +32 2 704 94 67
> 
> Fax: +32 2 704 95 95
> 
> Mob: +32 497 909353
> 
> koen_segers at computacenter.com
> 
> www.computacenter.com/benelux
> 
> 
> 
>  
> 
> 
> From: Rupert Dance <rsdance at soft-forge.com> [mailto:Rupert Dance
> <rsdance at soft-forge.com>] 
> Sent: 17 October 2011 13:46
> To: <koen_segers at computacenter.com>; <ewg at lists.openfabrics.org>
> Subject: RE: [ewg] 200m cable results in slower rdma read performance?
> [ CC Anti-Virus checked ]
> 
> 
> 
>  
> 
> Hi,
> 
>  
> 
> Have you run ibdiagnet to verify that your link width and speed is
> what you expect on all links?
> 
>  
> 
> Thanks
> 
>  
> 
> Rupert Dance
> 
>  
> 
> Software Forge
> 
>  
> 
> 
> From: ewg-bounces at lists.openfabrics.org
> [mailto:ewg-bounces at lists.openfabrics.org] On Behalf Of
> koen_segers at computacenter.com
> Sent: Monday, October 17, 2011 3:22 AM
> To: ewg at lists.openfabrics.org
> Subject: [ewg] 200m cable results in slower rdma read performance?
> [ CC Anti-Virus checked ]
> 
> 
> 
>  
> 
> Hi, 
> 
>  
> 
> In my test setup I have 3 servers of which 2 are residing in
> Datacenter1 and the other in Datacenter2.
> 
> If I do a rdma test between datacenters, I get a much lower
> performance than if I would do the test between servers residing in
> the same datacenter.
> 
>  
> 
>  
> 
> DC1: gpfsprod1n1, gpfsprod1n3
> 
> DC2: gpfsprod1n2
> 
>  
> 
> 08:54:48|root at gpfsprod1n1:~ 0 # qperf -t 5  cic-gpfsprod1n2
> rc_rdma_write_bw
> 
> rc_rdma_write_bw:
> 
>     bw  =  1.9 GB/sec
> 
> 08:54:59|root at gpfsprod1n1:~ 0 # qperf -t 5  cic-gpfsprod1n3
> rc_rdma_write_bw
> 
> rc_rdma_write_bw:
> 
>     bw  =  3.39 GB/sec
> 
>  
> 
>  
> 
> The setup contains two pairs of edge switches (on each datacenter one
> pair) and two spine switches (each datacenter one), configured as a
> non blocking fat tree.
> 
> So:
> 
> The servers are connected to the edge switches.
> 
> The spine switches are connected to all edge switches.
> 
>  
> 
> These are the cables we are using:
> 
> ·         Length 5m Vendor Name: WLGORE Code: QSFP+ Vendor PN:
> 498385-B24 Vendor Rev: D Vendor SN xxxx
> 
> ·         Length 200m Vendor Name: MOLEX Code: QSFP+ Vendor PN:
> 106410-1200 Vendor Rev: A Vendor SN xxxx
> 
>  
> 
>  
> 
> Can someone tell me why this is so? And maybe how I can solve this?
> 
>  
> 
> Best regards,
> 
>  
> 
>  
> 
> Koen Segers
> 
> Enterprise Consultant
> 
>  
> 
> Computacenter
> 
> Services & Solutions
> 
>  
> 
> Ikaroslaan 31
> 
> B-1930 Zaventem
> 
> Belgium
> 
>  
> 
> Tel: +32 2 704 94 67
> 
> Fax: +32 2 704 95 95
> 
> Mob: +32 497 909353
> 
> koen_segers at computacenter.com
> 
> www.computacenter.com/benelux
> 
>  
> 
> 
> 
> 
> Visit us at http://www.computacenter.com/
> Computacenter: Transforming IT service delivery.
> 
> ========================== Disclaimer
> ==================================
> The information in this email is confidential, and is intended solely
> for the addressee(s). If you are not the intended recipient of this
> email please let us know by reply and then delete it from your system;
> you should not copy this message or disclose its contents to anyone.
> Due to the integrity risk of sending emails over the Internet,
> Computacenter will accept no liability for any comments and / or
> attachments contained within this email.
> ========================== Disclaimer
> ==================================
> 
> 
> 
> 
> Visit us at http://www.computacenter.com/
> Computacenter: Transforming IT service delivery.
> 
> ========================== Disclaimer
> ==================================
> The information in this email is confidential, and is intended solely
> for the addressee(s). If you are not the intended recipient of this
> email please let us know by reply and then delete it from your system;
> you should not copy this message or disclose its contents to anyone.
> Due to the integrity risk of sending emails over the Internet,
> Computacenter will accept no liability for any comments and / or
> attachments contained within this email.
> ========================== Disclaimer
> ==================================
> 
> 
> 
> 
> 
> 
> Visit us at http://www.computacenter.com/
> Computacenter: Transforming IT service delivery.
> 
> ========================== Disclaimer
> ==================================
> The information in this email is confidential, and is intended solely
> for the addressee(s). If you are not the intended recipient of this
> email please let us know by reply and then delete it from your system;
> you should not copy this message or disclose its contents to anyone.
> Due to the integrity risk of sending emails over the Internet,
> Computacenter will accept no liability for any comments and / or
> attachments contained within this email.
> ========================== Disclaimer
> ==================================
> 
> _______________________________________________
> ewg mailing list
> ewg at lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20111021/8e20ce1c/attachment.html>


More information about the ewg mailing list