[Users] Compatibility problems between OFED 1.5.3 and OFED 2.2 ?

Sébastien Dugué sebastien.dugue at bull.net
Fri Jun 27 02:16:35 PDT 2014


Hi Txema,

On Thu, 26 Jun 2014 16:01:59 +0200
Txema Heredia <txema.llistes at gmail.com> wrote:

> Thanks Sébastien!
> 
> I was worried because, when some of my colleagues tried to add to the 
> GPFS cluster some nodes using ofed 2.2, something went wrong and the 
> whole infiniband network collapsed. That's why I was wary of the change.
> Right now I am adding a couple of 2.2 nodes to the GPFS to check if the 
> problem was due to ofed or some other misconfiguration. I'll report back 
> if I detect any problem.
> 
> 
> As for ofed, I have a couple of questions:
> 
> - Is it safe/transparent to update from 1.5.3 to 2.2? Should I update my 
> gpfs servers? Should I wait? Should I keep them on 1.5.3? Would that 
> cause problems in the future?

  I can only speak concerning the community OFED. In fact we're migrating from
1.5.4.1 to 3.12 which are very roughly equivalent to mlnx 1.5.3 and mlnx 2.2 and
so far, it's only a matter of un-installing the old one and installing the
new one with a few kernel module parameter changes.

> 
> - Why is the mellanox ofed installer changing my ib0 mac address?? When 
> kickstarting the node, the ib0 mac address is 
> 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:55:a4:b9, but after 
> installing the mellanox drivers it changes to 
> a0:00:01:00:fe:80:00:00:00:00:00:00:00:02:c9:03:00:55:a4:b9. This makes 
> the node not to load the ib0 interface and start the GPFS service 
> through ethernet, until you manually "ifup ib0" the node.

  It's not the installer that change the HW address.

  An IPoIB address is constructed as follows:

  80		flags (bit 7 = Connected Mode)
  00:00:48	QP number (which may change if the ipoib module is reloaded
  fe:80...	Port GID

  Concerning the change of the flags from 80 to a0, I've no idea what flags bit 2
means (this is defined in the standard ipoib driver).

  As to the QP number, you must be prepared for it to possibly change if you
reload the ipoib driver after the system has booted and some QP have been created.

> 
> - Why do the modules change from ofed 1.5.3 to 2.2? My 1.5.3 
> installation generates the following file:
> 
> # cat /etc/modprobe.d/mlx4_en.conf
> install mlx4_core modprobe --ignore-install $((modprobe -c | grep -wq 
> "^allow_unsupported_modules") && echo '--allow-unsupported-modules') 
> mlx4_core && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q 
> "^MLX4_EN_LOAD=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then 
> modprobe mlx4_en; fi; else modprobe mlx4_en; fi
> install mlx4_en modprobe --ignore-install $((modprobe -c | grep -wq 
> "^allow_unsupported_modules") && echo '--allow-unsupported-modules') 
> mlx4_en && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q 
> "^RUN_SYSCTL=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then 
> /sbin/sysctl_perf_tuning load; fi; fi
> remove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r 
> --ignore-remove mlx4_en
> # Configure Flow Control
> # pfctx:Priority based Flow Control policy on TX[7:0]. Per priority bit 
> mask (uint)
> # pfcrx:Priority based Flow Control policy on RX[7:0]. Per priority bit 
> mask (uint)
> options mlx4_core pfctx=0 pfcrx=0
> 
> (whose last line we later modify to  "options mlx4_core pfctx=0 pfcrx=0 
> log_num_mtt=20 log_mtts_per_seg=4" for gpfs memory considerations)

  The pfctx and pfcrx are MLNX OFED specific and I have no idea what they do.
On the other hand, the log_xxx parameters make sense to allow registering lots of
memory. However with the newest OFED, the log_num_mtt does no longer exists
as the tuning is automatically done in the driver to allow registering twice
the size of the physical memory (if I remember correctly).

> 
> 
> But ofed-2.2 leaves the file like this:
> 
> # cat /etc/modprobe.d/mlnx.conf
> # Module parameters for MLNX_OFED kernel modules
> blacklist mlx4_core
> blacklist mlx4_en
> blacklist mlx5_core
> blacklist mlx5_ib
> 
> Should I add here the "options mlx4_core pfctx=0 pfcrx=0 log_num_mtt=20 
> log_mtts_per_seg=4" line? Or should I add it to mlx5_core? Aren't them 
> blacklisted?

  First mlx4 is for ConnectX[1-3] devices and mlx5 for Connect-IB device and from
your description I suppose you have ConnectX devices so you can forget about mlx5.

  Do not add log_num_mtt (it will prevent the driver from loading), you can keep
log_mtts_per_seg if it helps. However no idea concerning pfctx=0 pfcrx=0.

  What do 'modinfo mlx4_core' gives? If those pfctx and pfcrx are listed then
you can probably keep them.

  Hope this helps,

  Sébastien.

> 
> 
> Thanks in advance,
> 
> Txema
> 
> PS: As stated before, I apologize for not "speaking" infiniband, nor 
> modules.
> 
> 
> 
> 
> El 25/06/14 16:48, Sébastien Dugué escribió:
> >    Hi Txema,
> >
> >    it's a problem with perftest that changed the handshake mechanism between the 2
> > versions.
> >
> >    Try to build an OFED 1.5 perftest to run on the 2.2 OFED or the other
> > way around and everything works fine.
> >
> >    Hope this helps.
> >
> >    Sébastien.
> >
> > On Wed, 25 Jun 2014 16:37:29 +0200
> > Txema Heredia <txema.llistes at gmail.com> wrote:
> >
> >> Hi all,
> >>
> >> We have a rocks-cluster 6.1 (RHEL 6.3) cluster (4 GPFS servers) and a
> >> rocks-cluster 6.0 (CentOS 6.2) cluster (28 GPFS clients), both using
> >> OFED 1.5.3 (installed through mellanox installer) and they work perfectly.
> >>
> >> Now, we are building a new cluster (rocks 6.1.1 - CentOS 6.5, all GPFS
> >> clients) and we are forced to install OFED 2.2, because the mellanox
> >> OFED 1.5.3 installer supports only up to CentOS/RHEL 6.4. I have been
> >> doing some testing with 3 nodes (not connected to GPFS yet), but I am
> >> having some problems:
> >>
> >> ibping seems to work fine:
> >>
> >> ofed-1.5.3 emitter vs 2.2 receiver:
> >> [root at compute-1-11 stress]# ibping -G 0x0002c9030055a559
> >> Pong from compute-2-0.local.(none) (Lid 38): time 0.107 ms
> >> Pong from compute-2-0.local.(none) (Lid 38): time 0.099 ms
> >> Pong from compute-2-0.local.(none) (Lid 38): time 0.097 ms
> >>
> >> ofed-2.2 emitter vs 2.2 receiver:
> >> [root at compute-2-1 stress]# ibping -G 0x0002c9030055a559
> >> Pong from compute-2-0.local.(none) (Lid 38): time 0.103 ms
> >> Pong from compute-2-0.local.(none) (Lid 38): time 0.098 ms
> >> Pong from compute-2-0.local.(none) (Lid 38): time 0.080 ms
> >>
> >> (cpu load is ~12. If I rise it to ~15, the latency goes down to ~0.035
> >> ms in both cases)
> >>
> >>
> >> But problems appear when trying to run ib_read_bw (or any other
> >> ib_read/write_bw/lat):
> >>
> >> ofed-2.2 emitter vs 2.2 receiver (both with high CPU-load to avoid
> >> cpu-throttling):
> >> server:
> >> [root at compute-2-0 stress]# ib_read_bw
> >>
> >> ************************************
> >> * Waiting for client to connect... *
> >> ************************************
> >> ---------------------------------------------------------------------------------------
> >>                       RDMA_Read BW Test
> >>    Dual-port       : OFF          Device         : mlx4_0
> >>    Number of qps   : 1            Transport type : IB
> >>    Connection type : RC           Using SRQ      : OFF
> >>    CQ Moderation   : 100
> >>    Mtu             : 4096[B]
> >>    Link type       : IB
> >>    Outstand reads  : 16
> >>    rdma_cm QPs     : OFF
> >>    Data ex. method : Ethernet
> >> ---------------------------------------------------------------------------------------
> >>    local address: LID 0x26 QPN 0x007b PSN 0x884509 OUT 0x10 RKey
> >> 0xc0002300 VAddr 0x007fb0fddb0000
> >>    remote address: LID 0x27 QPN 0x0078 PSN 0x3c5ae6 OUT 0x10 RKey
> >> 0x78002300 VAddr 0x007f4f8bc10000
> >> ---------------------------------------------------------------------------------------
> >>    #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
> >> MsgRate[Mpps]
> >>    65536      1000           3228.29            3228.26 0.051652
> >> ---------------------------------------------------------------------------------------
> >>
> >> client:
> >> [root at compute-2-1 ~]# ib_read_bw 192.168.0.100
> >> ---------------------------------------------------------------------------------------
> >>                       RDMA_Read BW Test
> >>    Dual-port       : OFF          Device         : mlx4_0
> >>    Number of qps   : 1            Transport type : IB
> >>    Connection type : RC           Using SRQ      : OFF
> >>    TX depth        : 128
> >>    CQ Moderation   : 100
> >>    Mtu             : 4096[B]
> >>    Link type       : IB
> >>    Outstand reads  : 16
> >>    rdma_cm QPs     : OFF
> >>    Data ex. method : Ethernet
> >> ---------------------------------------------------------------------------------------
> >>    local address: LID 0x27 QPN 0x0078 PSN 0x3c5ae6 OUT 0x10 RKey
> >> 0x78002300 VAddr 0x007f4f8bc10000
> >>    remote address: LID 0x26 QPN 0x007b PSN 0x884509 OUT 0x10 RKey
> >> 0xc0002300 VAddr 0x007fb0fddb0000
> >> ---------------------------------------------------------------------------------------
> >>    #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
> >> MsgRate[Mpps]
> >>    65536      1000           3228.29            3228.26 0.051652
> >> ---------------------------------------------------------------------------------------
> >>
> >>
> >>
> >> ofed-1.5.3 emitter vs 2.2 receiver
> >> server
> >> [root at compute-2-0 stress]# ib_read_bw
> >>
> >> ************************************
> >> * Waiting for client to connect... *
> >> ************************************
> >> ---------------------------------------------------------------------------------------
> >>                       RDMA_Read BW Test
> >>    Dual-port       : OFF          Device         : mlx4_0
> >>    Number of qps   : 1            Transport type : IB
> >>    Connection type : RC           Using SRQ      : OFF
> >>    CQ Moderation   : 100
> >>    Mtu             : 4096[B]
> >>    Link type       : IB
> >>    Outstand reads  : 16
> >>    rdma_cm QPs     : OFF
> >>    Data ex. method : Ethernet
> >> ---------------------------------------------------------------------------------------
> >>    local address: LID 0x26 QPN 0x007d PSN 0xb9ca21 OUT 0x10 RKey
> >> 0xc8002300 VAddr 0x007f938bdf0000
> >> ethernet_read_keys: Couldn't read remote address
> >>    Unable to read to socket/rdam_cm
> >> Failed to exchange data between server and clients
> >>
> >> client:
> >> [root at compute-1-11 stress]# ib_read_bw 192.168.0.100
> >> ------------------------------------------------------------------
> >>                       RDMA_Read BW Test
> >>    Number of qps   : 1
> >>    Connection type : RC
> >>    TX depth        : 300
> >>    CQ Moderation   : 50
> >>    Mtu             : 2048B
> >>    Link type       : IB
> >>    Outstand reads  : 16
> >>    rdma_cm QPs     : OFF
> >>    Data ex. method : Ethernet
> >> ------------------------------------------------------------------
> >>    local address: LID 0x25 QPN 0x6c0063 PSN 0x38032d OUT 0x10 RKey
> >> 0x18002794 VAddr 0x007f5fc52ad000
> >> pp_read_keys: Success
> >> Couldn't read remote address
> >>    Unable to read from socket/rdam_cm
> >> Failed to exchange date between server and clients
> >>
> >> As you can see, when using 1.5.3 vs 2.2, neither the client nor the
> >> server are able to get the LID of the other node.
> >> That "unable to read from socket/rdam_cm" message also appears when
> >> trying to run ib_read_bw using both nodes ofed 2.2, but with
> >> cpu-throttling, so it seems the default "I cannot do that" message:
> >>
> >>
> >>
> >> server
> >> [root at compute-2-0 stress]# ib_read_bw
> >>
> >> ************************************
> >> * Waiting for client to connect... *
> >> ************************************
> >> ---------------------------------------------------------------------------------------
> >>                       RDMA_Read BW Test
> >>    Dual-port       : OFF          Device         : mlx4_0
> >>    Number of qps   : 1            Transport type : IB
> >>    Connection type : RC           Using SRQ      : OFF
> >>    CQ Moderation   : 100
> >>    Mtu             : 4096[B]
> >>    Link type       : IB
> >>    Outstand reads  : 16
> >>    rdma_cm QPs     : OFF
> >>    Data ex. method : Ethernet
> >> ---------------------------------------------------------------------------------------
> >>    local address: LID 0x26 QPN 0x007e PSN 0x5a12d8 OUT 0x10 RKey
> >> 0xd0002300 VAddr 0x007f2b0f290000
> >>    remote address: LID 0x27 QPN 0x0079 PSN 0xa56976 OUT 0x10 RKey
> >> 0x80002300 VAddr 0x007f3945810000
> >> ---------------------------------------------------------------------------------------
> >>    #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
> >> MsgRate[Mpps]
> >> ethernet_read_keys: Couldn't read remote address
> >>    Unable to read to socket/rdam_cm
> >>    Failed to exchange data between server and clients
> >>
> >>
> >> client:
> >>
> >> [root at compute-2-1 ~]# ib_read_bw 192.168.0.100
> >> ---------------------------------------------------------------------------------------
> >>                       RDMA_Read BW Test
> >>    Dual-port       : OFF          Device         : mlx4_0
> >>    Number of qps   : 1            Transport type : IB
> >>    Connection type : RC           Using SRQ      : OFF
> >>    TX depth        : 128
> >>    CQ Moderation   : 100
> >>    Mtu             : 4096[B]
> >>    Link type       : IB
> >>    Outstand reads  : 16
> >>    rdma_cm QPs     : OFF
> >>    Data ex. method : Ethernet
> >> ---------------------------------------------------------------------------------------
> >>    local address: LID 0x27 QPN 0x0079 PSN 0xa56976 OUT 0x10 RKey
> >> 0x80002300 VAddr 0x007f3945810000
> >>    remote address: LID 0x26 QPN 0x007e PSN 0x5a12d8 OUT 0x10 RKey
> >> 0xd0002300 VAddr 0x007f2b0f290000
> >> ---------------------------------------------------------------------------------------
> >>    #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]
> >> MsgRate[Mpps]
> >> Conflicting CPU frequency values detected: 1596.000000 != 2394.000000
> >> Can't produce a report
> >>
> >>
> >>
> >>
> >>
> >> Is there any reported compatibility problem between nodes with OFED
> >> 1.5.3 and OFED 2.2? Can they coexist in the same infiniband network? Can
> >> they communicate properly? Or is this just a problem of different
> >> versions of the testing binaries (perftest-2.2-0.14 vs perftest-1.3.0-0.56)?
> >> Is there any other test I can run impervious to this?
> >>
> >> Thanks in advance,
> >>
> >> Txema
> >>
> >> PS: I am not very infiniband-savvy, so probably I am misusing some terms.
> >> _______________________________________________
> >> Users mailing list
> >> Users at lists.openfabrics.org
> >> http://lists.openfabrics.org/mailman/listinfo/users
> 



More information about the Users mailing list