[Users] Compatibility problems between OFED 1.5.3 and OFED 2.2 ?

Txema Heredia txema.llistes at gmail.com
Fri Jun 27 06:28:18 PDT 2014


El 27/06/14 11:16, Sébastien Dugué escribió:
> Hi Txema,
>
> On Thu, 26 Jun 2014 16:01:59 +0200
> Txema Heredia <txema.llistes at gmail.com> wrote:
>
>> Thanks Sébastien!
>>
>> I was worried because, when some of my colleagues tried to add to the
>> GPFS cluster some nodes using ofed 2.2, something went wrong and the
>> whole infiniband network collapsed. That's why I was wary of the change.
>> Right now I am adding a couple of 2.2 nodes to the GPFS to check if the
>> problem was due to ofed or some other misconfiguration. I'll report back
>> if I detect any problem.
>>
>>
>> As for ofed, I have a couple of questions:
>>
>> - Is it safe/transparent to update from 1.5.3 to 2.2? Should I update my
>> gpfs servers? Should I wait? Should I keep them on 1.5.3? Would that
>> cause problems in the future?
>    I can only speak concerning the community OFED. In fact we're migrating from
> 1.5.4.1 to 3.12 which are very roughly equivalent to mlnx 1.5.3 and mlnx 2.2 and
> so far, it's only a matter of un-installing the old one and installing the
> new one with a few kernel module parameter changes.
>
>> - Why is the mellanox ofed installer changing my ib0 mac address?? When
>> kickstarting the node, the ib0 mac address is
>> 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:55:a4:b9, but after
>> installing the mellanox drivers it changes to
>> a0:00:01:00:fe:80:00:00:00:00:00:00:00:02:c9:03:00:55:a4:b9. This makes
>> the node not to load the ib0 interface and start the GPFS service
>> through ethernet, until you manually "ifup ib0" the node.
>    It's not the installer that change the HW address.
>
>    An IPoIB address is constructed as follows:
>
>    80		flags (bit 7 = Connected Mode)
>    00:00:48	QP number (which may change if the ipoib module is reloaded
>    fe:80...	Port GID
>
>    Concerning the change of the flags from 80 to a0, I've no idea what flags bit 2
> means (this is defined in the standard ipoib driver).
>
>    As to the QP number, you must be prepared for it to possibly change if you
> reload the ipoib driver after the system has booted and some QP have been created.
>
>> - Why do the modules change from ofed 1.5.3 to 2.2? My 1.5.3
>> installation generates the following file:
>>
>> # cat /etc/modprobe.d/mlx4_en.conf
>> install mlx4_core modprobe --ignore-install $((modprobe -c | grep -wq
>> "^allow_unsupported_modules") && echo '--allow-unsupported-modules')
>> mlx4_core && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q
>> "^MLX4_EN_LOAD=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then
>> modprobe mlx4_en; fi; else modprobe mlx4_en; fi
>> install mlx4_en modprobe --ignore-install $((modprobe -c | grep -wq
>> "^allow_unsupported_modules") && echo '--allow-unsupported-modules')
>> mlx4_en && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q
>> "^RUN_SYSCTL=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then
>> /sbin/sysctl_perf_tuning load; fi; fi
>> remove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r
>> --ignore-remove mlx4_en
>> # Configure Flow Control
>> # pfctx:Priority based Flow Control policy on TX[7:0]. Per priority bit
>> mask (uint)
>> # pfcrx:Priority based Flow Control policy on RX[7:0]. Per priority bit
>> mask (uint)
>> options mlx4_core pfctx=0 pfcrx=0
>>
>> (whose last line we later modify to  "options mlx4_core pfctx=0 pfcrx=0
>> log_num_mtt=20 log_mtts_per_seg=4" for gpfs memory considerations)
>    The pfctx and pfcrx are MLNX OFED specific and I have no idea what they do.
> On the other hand, the log_xxx parameters make sense to allow registering lots of
> memory. However with the newest OFED, the log_num_mtt does no longer exists
> as the tuning is automatically done in the driver to allow registering twice
> the size of the physical memory (if I remember correctly).
>
>>
>> But ofed-2.2 leaves the file like this:
>>
>> # cat /etc/modprobe.d/mlnx.conf
>> # Module parameters for MLNX_OFED kernel modules
>> blacklist mlx4_core
>> blacklist mlx4_en
>> blacklist mlx5_core
>> blacklist mlx5_ib
>>
>> Should I add here the "options mlx4_core pfctx=0 pfcrx=0 log_num_mtt=20
>> log_mtts_per_seg=4" line? Or should I add it to mlx5_core? Aren't them
>> blacklisted?
>    First mlx4 is for ConnectX[1-3] devices and mlx5 for Connect-IB device and from
> your description I suppose you have ConnectX devices so you can forget about mlx5.
>
>    Do not add log_num_mtt (it will prevent the driver from loading), you can keep
> log_mtts_per_seg if it helps. However no idea concerning pfctx=0 pfcrx=0.
>
>    What do 'modinfo mlx4_core' gives? If those pfctx and pfcrx are listed then
> you can probably keep them.
>
>    Hope this helps,
>
>    Sébastien.
>
>>
>> Thanks in advance,
>>
>> Txema
>>
>> PS: As stated before, I apologize for not "speaking" infiniband, nor
>> modules.
>>
>>
>>
>>
>> El 25/06/14 16:48, Sébastien Dugué escribió:
>>>     Hi Txema,
>>>
>>>     it's a problem with perftest that changed the handshake mechanism between the 2
>>> versions.
>>>
>>>     Try to build an OFED 1.5 perftest to run on the 2.2 OFED or the other
>>> way around and everything works fine.
>>>
>>>     Hope this helps.
>>>
>>>     Sébastien.
>>>
>>> On Wed, 25 Jun 2014 16:37:29 +0200
>>> Txema Heredia <txema.llistes at gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> We have a rocks-cluster 6.1 (RHEL 6.3) cluster (4 GPFS servers) and a
>>>> rocks-cluster 6.0 (CentOS 6.2) cluster (28 GPFS clients), both using
>>>> OFED 1.5.3 (installed through mellanox installer) and they work perfectly.
>>>>
>>>> Now, we are building a new cluster (rocks 6.1.1 - CentOS 6.5, all GPFS
>>>> clients) and we are forced to install OFED 2.2, because the mellanox
>>>> OFED 1.5.3 installer supports only up to CentOS/RHEL 6.4. I have been
>>>> doing some testing with 3 nodes (not connected to GPFS yet), but I am
>>>> having some problems:
>>>>
>>>> ibping seems to work fine:
>>>>
>>>> ofed-1.5.3 emitter vs 2.2 receiver:
>>>> [root at compute-1-11 stress]# ibping -G 0x0002c9030055a559
>>>> Pong from compute-2-0.local.(none) (Lid 38): time 0.107 ms
>>>> Pong from compute-2-0.local.(none) (Lid 38): time 0.099 ms
>>>> Pong from compute-2-0.local.(none) (Lid 38): time 0.097 ms
>>>>
>>>> ofed-2.2 emitter vs 2.2 receiver:
>>>> [root at compute-2-1 stress]# ibping -G 0x0002c9030055a559
>>>> Pong from compute-2-0.local.(none) (Lid 38): time 0.103 ms
>>>> Pong from compute-2-0.local.(none) (Lid 38): time 0.098 ms
>>>> Pong from compute-2-0.local.(none) (Lid 38): time 0.080 ms
>>>>
>>>> (cpu load is ~12. If I rise it to ~15, the latency goes down to ~0.035
>>>> ms in both cases)
>>>>
>>>>
>>>> But problems appear when trying to run ib_read_bw (or any other
>>>> ib_read/write_bw/lat):
>>>>
>>>> ofed-2.2 emitter vs 2.2 receiver (both with high CPU-load to avoid
>>>> cpu-throttling):
>>>> server:
>>>> [root at compute-2-0 stress]# ib_read_bw
>>>>
>>>> ************************************
>>>> * Waiting for client to connect... *
>>>> ************************************
>>>> ---------------------------------------------------------------------------------------
>>>>                        RDMA_Read BW Test
>>>>     Dual-port       : OFF          Device         : mlx4_0
>>>>     Number of qps   : 1            Transport type : IB
>>>>     Connection type : RC           Using SRQ      : OFF
>>>>     CQ Moderation   : 100
>>>>     Mtu             : 4096[B]
>>>>     Link type       : IB
>>>>     Outstand reads  : 16
>>>>     rdma_cm QPs     : OFF
>>>>     Data ex. method : Ethernet
>>>> ---------------------------------------------------------------------------------------
>>>>     local address: LID 0x26 QPN 0x007b PSN 0x884509 OUT 0x10 RKey
>>>> 0xc0002300 VAddr 0x007fb0fddb0000
>>>>     remote address: LID 0x27 QPN 0x0078 PSN 0x3c5ae6 OUT 0x10 RKey
>>>> 0x78002300 VAddr 0x007f4f8bc10000
>>>> ---------------------------------------------------------------------------------------
>>>>     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>>>> MsgRate[Mpps]
>>>>     65536      1000           3228.29            3228.26 0.051652
>>>> ---------------------------------------------------------------------------------------
>>>>
>>>> client:
>>>> [root at compute-2-1 ~]# ib_read_bw 192.168.0.100
>>>> ---------------------------------------------------------------------------------------
>>>>                        RDMA_Read BW Test
>>>>     Dual-port       : OFF          Device         : mlx4_0
>>>>     Number of qps   : 1            Transport type : IB
>>>>     Connection type : RC           Using SRQ      : OFF
>>>>     TX depth        : 128
>>>>     CQ Moderation   : 100
>>>>     Mtu             : 4096[B]
>>>>     Link type       : IB
>>>>     Outstand reads  : 16
>>>>     rdma_cm QPs     : OFF
>>>>     Data ex. method : Ethernet
>>>> ---------------------------------------------------------------------------------------
>>>>     local address: LID 0x27 QPN 0x0078 PSN 0x3c5ae6 OUT 0x10 RKey
>>>> 0x78002300 VAddr 0x007f4f8bc10000
>>>>     remote address: LID 0x26 QPN 0x007b PSN 0x884509 OUT 0x10 RKey
>>>> 0xc0002300 VAddr 0x007fb0fddb0000
>>>> ---------------------------------------------------------------------------------------
>>>>     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>>>> MsgRate[Mpps]
>>>>     65536      1000           3228.29            3228.26 0.051652
>>>> ---------------------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>>> ofed-1.5.3 emitter vs 2.2 receiver
>>>> server
>>>> [root at compute-2-0 stress]# ib_read_bw
>>>>
>>>> ************************************
>>>> * Waiting for client to connect... *
>>>> ************************************
>>>> ---------------------------------------------------------------------------------------
>>>>                        RDMA_Read BW Test
>>>>     Dual-port       : OFF          Device         : mlx4_0
>>>>     Number of qps   : 1            Transport type : IB
>>>>     Connection type : RC           Using SRQ      : OFF
>>>>     CQ Moderation   : 100
>>>>     Mtu             : 4096[B]
>>>>     Link type       : IB
>>>>     Outstand reads  : 16
>>>>     rdma_cm QPs     : OFF
>>>>     Data ex. method : Ethernet
>>>> ---------------------------------------------------------------------------------------
>>>>     local address: LID 0x26 QPN 0x007d PSN 0xb9ca21 OUT 0x10 RKey
>>>> 0xc8002300 VAddr 0x007f938bdf0000
>>>> ethernet_read_keys: Couldn't read remote address
>>>>     Unable to read to socket/rdam_cm
>>>> Failed to exchange data between server and clients
>>>>
>>>> client:
>>>> [root at compute-1-11 stress]# ib_read_bw 192.168.0.100
>>>> ------------------------------------------------------------------
>>>>                        RDMA_Read BW Test
>>>>     Number of qps   : 1
>>>>     Connection type : RC
>>>>     TX depth        : 300
>>>>     CQ Moderation   : 50
>>>>     Mtu             : 2048B
>>>>     Link type       : IB
>>>>     Outstand reads  : 16
>>>>     rdma_cm QPs     : OFF
>>>>     Data ex. method : Ethernet
>>>> ------------------------------------------------------------------
>>>>     local address: LID 0x25 QPN 0x6c0063 PSN 0x38032d OUT 0x10 RKey
>>>> 0x18002794 VAddr 0x007f5fc52ad000
>>>> pp_read_keys: Success
>>>> Couldn't read remote address
>>>>     Unable to read from socket/rdam_cm
>>>> Failed to exchange date between server and clients
>>>>
>>>> As you can see, when using 1.5.3 vs 2.2, neither the client nor the
>>>> server are able to get the LID of the other node.
>>>> That "unable to read from socket/rdam_cm" message also appears when
>>>> trying to run ib_read_bw using both nodes ofed 2.2, but with
>>>> cpu-throttling, so it seems the default "I cannot do that" message:
>>>>
>>>>
>>>>
>>>> server
>>>> [root at compute-2-0 stress]# ib_read_bw
>>>>
>>>> ************************************
>>>> * Waiting for client to connect... *
>>>> ************************************
>>>> ---------------------------------------------------------------------------------------
>>>>                        RDMA_Read BW Test
>>>>     Dual-port       : OFF          Device         : mlx4_0
>>>>     Number of qps   : 1            Transport type : IB
>>>>     Connection type : RC           Using SRQ      : OFF
>>>>     CQ Moderation   : 100
>>>>     Mtu             : 4096[B]
>>>>     Link type       : IB
>>>>     Outstand reads  : 16
>>>>     rdma_cm QPs     : OFF
>>>>     Data ex. method : Ethernet
>>>> ---------------------------------------------------------------------------------------
>>>>     local address: LID 0x26 QPN 0x007e PSN 0x5a12d8 OUT 0x10 RKey
>>>> 0xd0002300 VAddr 0x007f2b0f290000
>>>>     remote address: LID 0x27 QPN 0x0079 PSN 0xa56976 OUT 0x10 RKey
>>>> 0x80002300 VAddr 0x007f3945810000
>>>> ---------------------------------------------------------------------------------------
>>>>     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>>>> MsgRate[Mpps]
>>>> ethernet_read_keys: Couldn't read remote address
>>>>     Unable to read to socket/rdam_cm
>>>>     Failed to exchange data between server and clients
>>>>
>>>>
>>>> client:
>>>>
>>>> [root at compute-2-1 ~]# ib_read_bw 192.168.0.100
>>>> ---------------------------------------------------------------------------------------
>>>>                        RDMA_Read BW Test
>>>>     Dual-port       : OFF          Device         : mlx4_0
>>>>     Number of qps   : 1            Transport type : IB
>>>>     Connection type : RC           Using SRQ      : OFF
>>>>     TX depth        : 128
>>>>     CQ Moderation   : 100
>>>>     Mtu             : 4096[B]
>>>>     Link type       : IB
>>>>     Outstand reads  : 16
>>>>     rdma_cm QPs     : OFF
>>>>     Data ex. method : Ethernet
>>>> ---------------------------------------------------------------------------------------
>>>>     local address: LID 0x27 QPN 0x0079 PSN 0xa56976 OUT 0x10 RKey
>>>> 0x80002300 VAddr 0x007f3945810000
>>>>     remote address: LID 0x26 QPN 0x007e PSN 0x5a12d8 OUT 0x10 RKey
>>>> 0xd0002300 VAddr 0x007f2b0f290000
>>>> ---------------------------------------------------------------------------------------
>>>>     #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
>>>> MsgRate[Mpps]
>>>> Conflicting CPU frequency values detected: 1596.000000 != 2394.000000
>>>> Can't produce a report
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Is there any reported compatibility problem between nodes with OFED
>>>> 1.5.3 and OFED 2.2? Can they coexist in the same infiniband network? Can
>>>> they communicate properly? Or is this just a problem of different
>>>> versions of the testing binaries (perftest-2.2-0.14 vs perftest-1.3.0-0.56)?
>>>> Is there any other test I can run impervious to this?
>>>>
>>>> Thanks in advance,
>>>>
>>>> Txema
>>>>
>>>> PS: I am not very infiniband-savvy, so probably I am misusing some terms.
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users at lists.openfabrics.org
>>>> http://lists.openfabrics.org/mailman/listinfo/users




More information about the Users mailing list