[Users] Compatibility problems between OFED 1.5.3 and OFED 2.2 ?

Txema Heredia txema.llistes at gmail.com
Thu Jun 26 07:01:59 PDT 2014


Thanks Sébastien!

I was worried because, when some of my colleagues tried to add to the 
GPFS cluster some nodes using ofed 2.2, something went wrong and the 
whole infiniband network collapsed. That's why I was wary of the change.
Right now I am adding a couple of 2.2 nodes to the GPFS to check if the 
problem was due to ofed or some other misconfiguration. I'll report back 
if I detect any problem.


As for ofed, I have a couple of questions:

- Is it safe/transparent to update from 1.5.3 to 2.2? Should I update my 
gpfs servers? Should I wait? Should I keep them on 1.5.3? Would that 
cause problems in the future?

- Why is the mellanox ofed installer changing my ib0 mac address?? When 
kickstarting the node, the ib0 mac address is 
80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:55:a4:b9, but after 
installing the mellanox drivers it changes to 
a0:00:01:00:fe:80:00:00:00:00:00:00:00:02:c9:03:00:55:a4:b9. This makes 
the node not to load the ib0 interface and start the GPFS service 
through ethernet, until you manually "ifup ib0" the node.

- Why do the modules change from ofed 1.5.3 to 2.2? My 1.5.3 
installation generates the following file:

# cat /etc/modprobe.d/mlx4_en.conf
install mlx4_core modprobe --ignore-install $((modprobe -c | grep -wq 
"^allow_unsupported_modules") && echo '--allow-unsupported-modules') 
mlx4_core && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q 
"^MLX4_EN_LOAD=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then 
modprobe mlx4_en; fi; else modprobe mlx4_en; fi
install mlx4_en modprobe --ignore-install $((modprobe -c | grep -wq 
"^allow_unsupported_modules") && echo '--allow-unsupported-modules') 
mlx4_en && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q 
"^RUN_SYSCTL=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then 
/sbin/sysctl_perf_tuning load; fi; fi
remove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r 
--ignore-remove mlx4_en
# Configure Flow Control
# pfctx:Priority based Flow Control policy on TX[7:0]. Per priority bit 
mask (uint)
# pfcrx:Priority based Flow Control policy on RX[7:0]. Per priority bit 
mask (uint)
options mlx4_core pfctx=0 pfcrx=0

(whose last line we later modify to  "options mlx4_core pfctx=0 pfcrx=0 
log_num_mtt=20 log_mtts_per_seg=4" for gpfs memory considerations)


But ofed-2.2 leaves the file like this:

# cat /etc/modprobe.d/mlnx.conf
# Module parameters for MLNX_OFED kernel modules
blacklist mlx4_core
blacklist mlx4_en
blacklist mlx5_core
blacklist mlx5_ib

Should I add here the "options mlx4_core pfctx=0 pfcrx=0 log_num_mtt=20 
log_mtts_per_seg=4" line? Or should I add it to mlx5_core? Aren't them 
blacklisted?


Thanks in advance,

Txema

PS: As stated before, I apologize for not "speaking" infiniband, nor 
modules.




El 25/06/14 16:48, Sébastien Dugué escribió:
>    Hi Txema,
>
>    it's a problem with perftest that changed the handshake mechanism between the 2
> versions.
>
>    Try to build an OFED 1.5 perftest to run on the 2.2 OFED or the other
> way around and everything works fine.
>
>    Hope this helps.
>
>    Sébastien.
>
> On Wed, 25 Jun 2014 16:37:29 +0200
> Txema Heredia <txema.llistes at gmail.com> wrote:
>
>> Hi all,
>>
>> We have a rocks-cluster 6.1 (RHEL 6.3) cluster (4 GPFS servers) and a
>> rocks-cluster 6.0 (CentOS 6.2) cluster (28 GPFS clients), both using
>> OFED 1.5.3 (installed through mellanox installer) and they work perfectly.
>>
>> Now, we are building a new cluster (rocks 6.1.1 - CentOS 6.5, all GPFS
>> clients) and we are forced to install OFED 2.2, because the mellanox
>> OFED 1.5.3 installer supports only up to CentOS/RHEL 6.4. I have been
>> doing some testing with 3 nodes (not connected to GPFS yet), but I am
>> having some problems:
>>
>> ibping seems to work fine:
>>
>> ofed-1.5.3 emitter vs 2.2 receiver:
>> [root at compute-1-11 stress]# ibping -G 0x0002c9030055a559
>> Pong from compute-2-0.local.(none) (Lid 38): time 0.107 ms
>> Pong from compute-2-0.local.(none) (Lid 38): time 0.099 ms
>> Pong from compute-2-0.local.(none) (Lid 38): time 0.097 ms
>>
>> ofed-2.2 emitter vs 2.2 receiver:
>> [root at compute-2-1 stress]# ibping -G 0x0002c9030055a559
>> Pong from compute-2-0.local.(none) (Lid 38): time 0.103 ms
>> Pong from compute-2-0.local.(none) (Lid 38): time 0.098 ms
>> Pong from compute-2-0.local.(none) (Lid 38): time 0.080 ms
>>
>> (cpu load is ~12. If I rise it to ~15, the latency goes down to ~0.035
>> ms in both cases)
>>
>>
>> But problems appear when trying to run ib_read_bw (or any other
>> ib_read/write_bw/lat):
>>
>> ofed-2.2 emitter vs 2.2 receiver (both with high CPU-load to avoid
>> cpu-throttling):
>> server:
>> [root at compute-2-0 stress]# ib_read_bw
>>
>> ************************************
>> * Waiting for client to connect... *
>> ************************************
>> ---------------------------------------------------------------------------------------
>>                       RDMA_Read BW Test
>>    Dual-port       : OFF          Device         : mlx4_0
>>    Number of qps   : 1            Transport type : IB
>>    Connection type : RC           Using SRQ      : OFF
>>    CQ Moderation   : 100
>>    Mtu             : 4096[B]
>>    Link type       : IB
>>    Outstand reads  : 16
>>    rdma_cm QPs     : OFF
>>    Data ex. method : Ethernet
>> ---------------------------------------------------------------------------------------
>>    local address: LID 0x26 QPN 0x007b PSN 0x884509 OUT 0x10 RKey
>> 0xc0002300 VAddr 0x007fb0fddb0000
>>    remote address: LID 0x27 QPN 0x0078 PSN 0x3c5ae6 OUT 0x10 RKey
>> 0x78002300 VAddr 0x007f4f8bc10000
>> ---------------------------------------------------------------------------------------
>>    #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>> MsgRate[Mpps]
>>    65536      1000           3228.29            3228.26 0.051652
>> ---------------------------------------------------------------------------------------
>>
>> client:
>> [root at compute-2-1 ~]# ib_read_bw 192.168.0.100
>> ---------------------------------------------------------------------------------------
>>                       RDMA_Read BW Test
>>    Dual-port       : OFF          Device         : mlx4_0
>>    Number of qps   : 1            Transport type : IB
>>    Connection type : RC           Using SRQ      : OFF
>>    TX depth        : 128
>>    CQ Moderation   : 100
>>    Mtu             : 4096[B]
>>    Link type       : IB
>>    Outstand reads  : 16
>>    rdma_cm QPs     : OFF
>>    Data ex. method : Ethernet
>> ---------------------------------------------------------------------------------------
>>    local address: LID 0x27 QPN 0x0078 PSN 0x3c5ae6 OUT 0x10 RKey
>> 0x78002300 VAddr 0x007f4f8bc10000
>>    remote address: LID 0x26 QPN 0x007b PSN 0x884509 OUT 0x10 RKey
>> 0xc0002300 VAddr 0x007fb0fddb0000
>> ---------------------------------------------------------------------------------------
>>    #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>> MsgRate[Mpps]
>>    65536      1000           3228.29            3228.26 0.051652
>> ---------------------------------------------------------------------------------------
>>
>>
>>
>> ofed-1.5.3 emitter vs 2.2 receiver
>> server
>> [root at compute-2-0 stress]# ib_read_bw
>>
>> ************************************
>> * Waiting for client to connect... *
>> ************************************
>> ---------------------------------------------------------------------------------------
>>                       RDMA_Read BW Test
>>    Dual-port       : OFF          Device         : mlx4_0
>>    Number of qps   : 1            Transport type : IB
>>    Connection type : RC           Using SRQ      : OFF
>>    CQ Moderation   : 100
>>    Mtu             : 4096[B]
>>    Link type       : IB
>>    Outstand reads  : 16
>>    rdma_cm QPs     : OFF
>>    Data ex. method : Ethernet
>> ---------------------------------------------------------------------------------------
>>    local address: LID 0x26 QPN 0x007d PSN 0xb9ca21 OUT 0x10 RKey
>> 0xc8002300 VAddr 0x007f938bdf0000
>> ethernet_read_keys: Couldn't read remote address
>>    Unable to read to socket/rdam_cm
>> Failed to exchange data between server and clients
>>
>> client:
>> [root at compute-1-11 stress]# ib_read_bw 192.168.0.100
>> ------------------------------------------------------------------
>>                       RDMA_Read BW Test
>>    Number of qps   : 1
>>    Connection type : RC
>>    TX depth        : 300
>>    CQ Moderation   : 50
>>    Mtu             : 2048B
>>    Link type       : IB
>>    Outstand reads  : 16
>>    rdma_cm QPs     : OFF
>>    Data ex. method : Ethernet
>> ------------------------------------------------------------------
>>    local address: LID 0x25 QPN 0x6c0063 PSN 0x38032d OUT 0x10 RKey
>> 0x18002794 VAddr 0x007f5fc52ad000
>> pp_read_keys: Success
>> Couldn't read remote address
>>    Unable to read from socket/rdam_cm
>> Failed to exchange date between server and clients
>>
>> As you can see, when using 1.5.3 vs 2.2, neither the client nor the
>> server are able to get the LID of the other node.
>> That "unable to read from socket/rdam_cm" message also appears when
>> trying to run ib_read_bw using both nodes ofed 2.2, but with
>> cpu-throttling, so it seems the default "I cannot do that" message:
>>
>>
>>
>> server
>> [root at compute-2-0 stress]# ib_read_bw
>>
>> ************************************
>> * Waiting for client to connect... *
>> ************************************
>> ---------------------------------------------------------------------------------------
>>                       RDMA_Read BW Test
>>    Dual-port       : OFF          Device         : mlx4_0
>>    Number of qps   : 1            Transport type : IB
>>    Connection type : RC           Using SRQ      : OFF
>>    CQ Moderation   : 100
>>    Mtu             : 4096[B]
>>    Link type       : IB
>>    Outstand reads  : 16
>>    rdma_cm QPs     : OFF
>>    Data ex. method : Ethernet
>> ---------------------------------------------------------------------------------------
>>    local address: LID 0x26 QPN 0x007e PSN 0x5a12d8 OUT 0x10 RKey
>> 0xd0002300 VAddr 0x007f2b0f290000
>>    remote address: LID 0x27 QPN 0x0079 PSN 0xa56976 OUT 0x10 RKey
>> 0x80002300 VAddr 0x007f3945810000
>> ---------------------------------------------------------------------------------------
>>    #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>> MsgRate[Mpps]
>> ethernet_read_keys: Couldn't read remote address
>>    Unable to read to socket/rdam_cm
>>    Failed to exchange data between server and clients
>>
>>
>> client:
>>
>> [root at compute-2-1 ~]# ib_read_bw 192.168.0.100
>> ---------------------------------------------------------------------------------------
>>                       RDMA_Read BW Test
>>    Dual-port       : OFF          Device         : mlx4_0
>>    Number of qps   : 1            Transport type : IB
>>    Connection type : RC           Using SRQ      : OFF
>>    TX depth        : 128
>>    CQ Moderation   : 100
>>    Mtu             : 4096[B]
>>    Link type       : IB
>>    Outstand reads  : 16
>>    rdma_cm QPs     : OFF
>>    Data ex. method : Ethernet
>> ---------------------------------------------------------------------------------------
>>    local address: LID 0x27 QPN 0x0079 PSN 0xa56976 OUT 0x10 RKey
>> 0x80002300 VAddr 0x007f3945810000
>>    remote address: LID 0x26 QPN 0x007e PSN 0x5a12d8 OUT 0x10 RKey
>> 0xd0002300 VAddr 0x007f2b0f290000
>> ---------------------------------------------------------------------------------------
>>    #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>> MsgRate[Mpps]
>> Conflicting CPU frequency values detected: 1596.000000 != 2394.000000
>> Can't produce a report
>>
>>
>>
>>
>>
>> Is there any reported compatibility problem between nodes with OFED
>> 1.5.3 and OFED 2.2? Can they coexist in the same infiniband network? Can
>> they communicate properly? Or is this just a problem of different
>> versions of the testing binaries (perftest-2.2-0.14 vs perftest-1.3.0-0.56)?
>> Is there any other test I can run impervious to this?
>>
>> Thanks in advance,
>>
>> Txema
>>
>> PS: I am not very infiniband-savvy, so probably I am misusing some terms.
>> _______________________________________________
>> Users mailing list
>> Users at lists.openfabrics.org
>> http://lists.openfabrics.org/mailman/listinfo/users




More information about the Users mailing list