[Users] Compatibility problems between OFED 1.5.3 and OFED 2.2 ?
Txema Heredia
txema.llistes at gmail.com
Wed Jun 25 07:37:29 PDT 2014
Hi all,
We have a rocks-cluster 6.1 (RHEL 6.3) cluster (4 GPFS servers) and a
rocks-cluster 6.0 (CentOS 6.2) cluster (28 GPFS clients), both using
OFED 1.5.3 (installed through mellanox installer) and they work perfectly.
Now, we are building a new cluster (rocks 6.1.1 - CentOS 6.5, all GPFS
clients) and we are forced to install OFED 2.2, because the mellanox
OFED 1.5.3 installer supports only up to CentOS/RHEL 6.4. I have been
doing some testing with 3 nodes (not connected to GPFS yet), but I am
having some problems:
ibping seems to work fine:
ofed-1.5.3 emitter vs 2.2 receiver:
[root at compute-1-11 stress]# ibping -G 0x0002c9030055a559
Pong from compute-2-0.local.(none) (Lid 38): time 0.107 ms
Pong from compute-2-0.local.(none) (Lid 38): time 0.099 ms
Pong from compute-2-0.local.(none) (Lid 38): time 0.097 ms
ofed-2.2 emitter vs 2.2 receiver:
[root at compute-2-1 stress]# ibping -G 0x0002c9030055a559
Pong from compute-2-0.local.(none) (Lid 38): time 0.103 ms
Pong from compute-2-0.local.(none) (Lid 38): time 0.098 ms
Pong from compute-2-0.local.(none) (Lid 38): time 0.080 ms
(cpu load is ~12. If I rise it to ~15, the latency goes down to ~0.035
ms in both cases)
But problems appear when trying to run ib_read_bw (or any other
ib_read/write_bw/lat):
ofed-2.2 emitter vs 2.2 receiver (both with high CPU-load to avoid
cpu-throttling):
server:
[root at compute-2-0 stress]# ib_read_bw
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx4_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x26 QPN 0x007b PSN 0x884509 OUT 0x10 RKey
0xc0002300 VAddr 0x007fb0fddb0000
remote address: LID 0x27 QPN 0x0078 PSN 0x3c5ae6 OUT 0x10 RKey
0x78002300 VAddr 0x007f4f8bc10000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
MsgRate[Mpps]
65536 1000 3228.29 3228.26 0.051652
---------------------------------------------------------------------------------------
client:
[root at compute-2-1 ~]# ib_read_bw 192.168.0.100
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx4_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x27 QPN 0x0078 PSN 0x3c5ae6 OUT 0x10 RKey
0x78002300 VAddr 0x007f4f8bc10000
remote address: LID 0x26 QPN 0x007b PSN 0x884509 OUT 0x10 RKey
0xc0002300 VAddr 0x007fb0fddb0000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
MsgRate[Mpps]
65536 1000 3228.29 3228.26 0.051652
---------------------------------------------------------------------------------------
ofed-1.5.3 emitter vs 2.2 receiver
server
[root at compute-2-0 stress]# ib_read_bw
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx4_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x26 QPN 0x007d PSN 0xb9ca21 OUT 0x10 RKey
0xc8002300 VAddr 0x007f938bdf0000
ethernet_read_keys: Couldn't read remote address
Unable to read to socket/rdam_cm
Failed to exchange data between server and clients
client:
[root at compute-1-11 stress]# ib_read_bw 192.168.0.100
------------------------------------------------------------------
RDMA_Read BW Test
Number of qps : 1
Connection type : RC
TX depth : 300
CQ Moderation : 50
Mtu : 2048B
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
------------------------------------------------------------------
local address: LID 0x25 QPN 0x6c0063 PSN 0x38032d OUT 0x10 RKey
0x18002794 VAddr 0x007f5fc52ad000
pp_read_keys: Success
Couldn't read remote address
Unable to read from socket/rdam_cm
Failed to exchange date between server and clients
As you can see, when using 1.5.3 vs 2.2, neither the client nor the
server are able to get the LID of the other node.
That "unable to read from socket/rdam_cm" message also appears when
trying to run ib_read_bw using both nodes ofed 2.2, but with
cpu-throttling, so it seems the default "I cannot do that" message:
server
[root at compute-2-0 stress]# ib_read_bw
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx4_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x26 QPN 0x007e PSN 0x5a12d8 OUT 0x10 RKey
0xd0002300 VAddr 0x007f2b0f290000
remote address: LID 0x27 QPN 0x0079 PSN 0xa56976 OUT 0x10 RKey
0x80002300 VAddr 0x007f3945810000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
MsgRate[Mpps]
ethernet_read_keys: Couldn't read remote address
Unable to read to socket/rdam_cm
Failed to exchange data between server and clients
client:
[root at compute-2-1 ~]# ib_read_bw 192.168.0.100
---------------------------------------------------------------------------------------
RDMA_Read BW Test
Dual-port : OFF Device : mlx4_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
TX depth : 128
CQ Moderation : 100
Mtu : 4096[B]
Link type : IB
Outstand reads : 16
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0x27 QPN 0x0079 PSN 0xa56976 OUT 0x10 RKey
0x80002300 VAddr 0x007f3945810000
remote address: LID 0x26 QPN 0x007e PSN 0x5a12d8 OUT 0x10 RKey
0xd0002300 VAddr 0x007f2b0f290000
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec]
MsgRate[Mpps]
Conflicting CPU frequency values detected: 1596.000000 != 2394.000000
Can't produce a report
Is there any reported compatibility problem between nodes with OFED
1.5.3 and OFED 2.2? Can they coexist in the same infiniband network? Can
they communicate properly? Or is this just a problem of different
versions of the testing binaries (perftest-2.2-0.14 vs perftest-1.3.0-0.56)?
Is there any other test I can run impervious to this?
Thanks in advance,
Txema
PS: I am not very infiniband-savvy, so probably I am misusing some terms.
More information about the Users
mailing list