[Users] Compatibility problems between OFED 1.5.3 and OFED 2.2 ?

Sébastien Dugué sebastien.dugue at bull.net
Wed Jun 25 07:48:55 PDT 2014


  Hi Txema,

  it's a problem with perftest that changed the handshake mechanism between the 2
versions.

  Try to build an OFED 1.5 perftest to run on the 2.2 OFED or the other
way around and everything works fine.

  Hope this helps.

  Sébastien.

On Wed, 25 Jun 2014 16:37:29 +0200
Txema Heredia <txema.llistes at gmail.com> wrote:

> Hi all,
> 
> We have a rocks-cluster 6.1 (RHEL 6.3) cluster (4 GPFS servers) and a 
> rocks-cluster 6.0 (CentOS 6.2) cluster (28 GPFS clients), both using 
> OFED 1.5.3 (installed through mellanox installer) and they work perfectly.
> 
> Now, we are building a new cluster (rocks 6.1.1 - CentOS 6.5, all GPFS 
> clients) and we are forced to install OFED 2.2, because the mellanox 
> OFED 1.5.3 installer supports only up to CentOS/RHEL 6.4. I have been 
> doing some testing with 3 nodes (not connected to GPFS yet), but I am 
> having some problems:
> 
> ibping seems to work fine:
> 
> ofed-1.5.3 emitter vs 2.2 receiver:
> [root at compute-1-11 stress]# ibping -G 0x0002c9030055a559
> Pong from compute-2-0.local.(none) (Lid 38): time 0.107 ms
> Pong from compute-2-0.local.(none) (Lid 38): time 0.099 ms
> Pong from compute-2-0.local.(none) (Lid 38): time 0.097 ms
> 
> ofed-2.2 emitter vs 2.2 receiver:
> [root at compute-2-1 stress]# ibping -G 0x0002c9030055a559
> Pong from compute-2-0.local.(none) (Lid 38): time 0.103 ms
> Pong from compute-2-0.local.(none) (Lid 38): time 0.098 ms
> Pong from compute-2-0.local.(none) (Lid 38): time 0.080 ms
> 
> (cpu load is ~12. If I rise it to ~15, the latency goes down to ~0.035 
> ms in both cases)
> 
> 
> But problems appear when trying to run ib_read_bw (or any other 
> ib_read/write_bw/lat):
> 
> ofed-2.2 emitter vs 2.2 receiver (both with high CPU-load to avoid 
> cpu-throttling):
> server:
> [root at compute-2-0 stress]# ib_read_bw
> 
> ************************************
> * Waiting for client to connect... *
> ************************************
> ---------------------------------------------------------------------------------------
>                      RDMA_Read BW Test
>   Dual-port       : OFF          Device         : mlx4_0
>   Number of qps   : 1            Transport type : IB
>   Connection type : RC           Using SRQ      : OFF
>   CQ Moderation   : 100
>   Mtu             : 4096[B]
>   Link type       : IB
>   Outstand reads  : 16
>   rdma_cm QPs     : OFF
>   Data ex. method : Ethernet
> ---------------------------------------------------------------------------------------
>   local address: LID 0x26 QPN 0x007b PSN 0x884509 OUT 0x10 RKey 
> 0xc0002300 VAddr 0x007fb0fddb0000
>   remote address: LID 0x27 QPN 0x0078 PSN 0x3c5ae6 OUT 0x10 RKey 
> 0x78002300 VAddr 0x007f4f8bc10000
> ---------------------------------------------------------------------------------------
>   #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec] 
> MsgRate[Mpps]
>   65536      1000           3228.29            3228.26 0.051652
> ---------------------------------------------------------------------------------------
> 
> client:
> [root at compute-2-1 ~]# ib_read_bw 192.168.0.100
> ---------------------------------------------------------------------------------------
>                      RDMA_Read BW Test
>   Dual-port       : OFF          Device         : mlx4_0
>   Number of qps   : 1            Transport type : IB
>   Connection type : RC           Using SRQ      : OFF
>   TX depth        : 128
>   CQ Moderation   : 100
>   Mtu             : 4096[B]
>   Link type       : IB
>   Outstand reads  : 16
>   rdma_cm QPs     : OFF
>   Data ex. method : Ethernet
> ---------------------------------------------------------------------------------------
>   local address: LID 0x27 QPN 0x0078 PSN 0x3c5ae6 OUT 0x10 RKey 
> 0x78002300 VAddr 0x007f4f8bc10000
>   remote address: LID 0x26 QPN 0x007b PSN 0x884509 OUT 0x10 RKey 
> 0xc0002300 VAddr 0x007fb0fddb0000
> ---------------------------------------------------------------------------------------
>   #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec] 
> MsgRate[Mpps]
>   65536      1000           3228.29            3228.26 0.051652
> ---------------------------------------------------------------------------------------
> 
> 
> 
> ofed-1.5.3 emitter vs 2.2 receiver
> server
> [root at compute-2-0 stress]# ib_read_bw
> 
> ************************************
> * Waiting for client to connect... *
> ************************************
> ---------------------------------------------------------------------------------------
>                      RDMA_Read BW Test
>   Dual-port       : OFF          Device         : mlx4_0
>   Number of qps   : 1            Transport type : IB
>   Connection type : RC           Using SRQ      : OFF
>   CQ Moderation   : 100
>   Mtu             : 4096[B]
>   Link type       : IB
>   Outstand reads  : 16
>   rdma_cm QPs     : OFF
>   Data ex. method : Ethernet
> ---------------------------------------------------------------------------------------
>   local address: LID 0x26 QPN 0x007d PSN 0xb9ca21 OUT 0x10 RKey 
> 0xc8002300 VAddr 0x007f938bdf0000
> ethernet_read_keys: Couldn't read remote address
>   Unable to read to socket/rdam_cm
> Failed to exchange data between server and clients
> 
> client:
> [root at compute-1-11 stress]# ib_read_bw 192.168.0.100
> ------------------------------------------------------------------
>                      RDMA_Read BW Test
>   Number of qps   : 1
>   Connection type : RC
>   TX depth        : 300
>   CQ Moderation   : 50
>   Mtu             : 2048B
>   Link type       : IB
>   Outstand reads  : 16
>   rdma_cm QPs     : OFF
>   Data ex. method : Ethernet
> ------------------------------------------------------------------
>   local address: LID 0x25 QPN 0x6c0063 PSN 0x38032d OUT 0x10 RKey 
> 0x18002794 VAddr 0x007f5fc52ad000
> pp_read_keys: Success
> Couldn't read remote address
>   Unable to read from socket/rdam_cm
> Failed to exchange date between server and clients
> 
> As you can see, when using 1.5.3 vs 2.2, neither the client nor the 
> server are able to get the LID of the other node.
> That "unable to read from socket/rdam_cm" message also appears when 
> trying to run ib_read_bw using both nodes ofed 2.2, but with 
> cpu-throttling, so it seems the default "I cannot do that" message:
> 
> 
> 
> server
> [root at compute-2-0 stress]# ib_read_bw
> 
> ************************************
> * Waiting for client to connect... *
> ************************************
> ---------------------------------------------------------------------------------------
>                      RDMA_Read BW Test
>   Dual-port       : OFF          Device         : mlx4_0
>   Number of qps   : 1            Transport type : IB
>   Connection type : RC           Using SRQ      : OFF
>   CQ Moderation   : 100
>   Mtu             : 4096[B]
>   Link type       : IB
>   Outstand reads  : 16
>   rdma_cm QPs     : OFF
>   Data ex. method : Ethernet
> ---------------------------------------------------------------------------------------
>   local address: LID 0x26 QPN 0x007e PSN 0x5a12d8 OUT 0x10 RKey 
> 0xd0002300 VAddr 0x007f2b0f290000
>   remote address: LID 0x27 QPN 0x0079 PSN 0xa56976 OUT 0x10 RKey 
> 0x80002300 VAddr 0x007f3945810000
> ---------------------------------------------------------------------------------------
>   #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec] 
> MsgRate[Mpps]
> ethernet_read_keys: Couldn't read remote address
>   Unable to read to socket/rdam_cm
>   Failed to exchange data between server and clients
> 
> 
> client:
> 
> [root at compute-2-1 ~]# ib_read_bw 192.168.0.100
> ---------------------------------------------------------------------------------------
>                      RDMA_Read BW Test
>   Dual-port       : OFF          Device         : mlx4_0
>   Number of qps   : 1            Transport type : IB
>   Connection type : RC           Using SRQ      : OFF
>   TX depth        : 128
>   CQ Moderation   : 100
>   Mtu             : 4096[B]
>   Link type       : IB
>   Outstand reads  : 16
>   rdma_cm QPs     : OFF
>   Data ex. method : Ethernet
> ---------------------------------------------------------------------------------------
>   local address: LID 0x27 QPN 0x0079 PSN 0xa56976 OUT 0x10 RKey 
> 0x80002300 VAddr 0x007f3945810000
>   remote address: LID 0x26 QPN 0x007e PSN 0x5a12d8 OUT 0x10 RKey 
> 0xd0002300 VAddr 0x007f2b0f290000
> ---------------------------------------------------------------------------------------
>   #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec] 
> MsgRate[Mpps]
> Conflicting CPU frequency values detected: 1596.000000 != 2394.000000
> Can't produce a report
> 
> 
> 
> 
> 
> Is there any reported compatibility problem between nodes with OFED 
> 1.5.3 and OFED 2.2? Can they coexist in the same infiniband network? Can 
> they communicate properly? Or is this just a problem of different 
> versions of the testing binaries (perftest-2.2-0.14 vs perftest-1.3.0-0.56)?
> Is there any other test I can run impervious to this?
> 
> Thanks in advance,
> 
> Txema
> 
> PS: I am not very infiniband-savvy, so probably I am misusing some terms.
> _______________________________________________
> Users mailing list
> Users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/users



More information about the Users mailing list