[Users] Compatibility problems between OFED 1.5.3 and OFED 2.2 ?

Sébastien Dugué sebastien.dugue at bull.net
Fri Jun 27 09:29:23 PDT 2014


On Fri, 27 Jun 2014 16:08:07 +0200
Txema Heredia <txema.llistes at gmail.com> wrote:

> El 27/06/14 11:16, Sébastien Dugué escribió:
> > Hi Txema,
> >
> > On Thu, 26 Jun 2014 16:01:59 +0200
> > Txema Heredia <txema.llistes at gmail.com> wrote:
> >
> >> Thanks Sébastien!
> >>
> >> I was worried because, when some of my colleagues tried to add to the
> >> GPFS cluster some nodes using ofed 2.2, something went wrong and the
> >> whole infiniband network collapsed. That's why I was wary of the change.
> >> Right now I am adding a couple of 2.2 nodes to the GPFS to check if the
> >> problem was due to ofed or some other misconfiguration. I'll report back
> >> if I detect any problem.
> >>
> >>
> >> As for ofed, I have a couple of questions:
> >>
> >> - Is it safe/transparent to update from 1.5.3 to 2.2? Should I update my
> >> gpfs servers? Should I wait? Should I keep them on 1.5.3? Would that
> >> cause problems in the future?
> >    I can only speak concerning the community OFED. In fact we're migrating from
> > 1.5.4.1 to 3.12 which are very roughly equivalent to mlnx 1.5.3 and mlnx 2.2 and
> > so far, it's only a matter of un-installing the old one and installing the
> > new one with a few kernel module parameter changes.
> >
> >> - Why is the mellanox ofed installer changing my ib0 mac address?? When
> >> kickstarting the node, the ib0 mac address is
> >> 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:55:a4:b9, but after
> >> installing the mellanox drivers it changes to
> >> a0:00:01:00:fe:80:00:00:00:00:00:00:00:02:c9:03:00:55:a4:b9. This makes
> >> the node not to load the ib0 interface and start the GPFS service
> >> through ethernet, until you manually "ifup ib0" the node.
> >    It's not the installer that change the HW address.
> >
> >    An IPoIB address is constructed as follows:
> >
> >    80		flags (bit 7 = Connected Mode)
> >    00:00:48	QP number (which may change if the ipoib module is reloaded
> >    fe:80...	Port GID
> >
> >    Concerning the change of the flags from 80 to a0, I've no idea what flags bit 2
> > means (this is defined in the standard ipoib driver).
> >
> >    As to the QP number, you must be prepared for it to possibly change if you
> > reload the ipoib driver after the system has booted and some QP have been created.
> >
> >> - Why do the modules change from ofed 1.5.3 to 2.2? My 1.5.3
> >> installation generates the following file:
> >>
> >> # cat /etc/modprobe.d/mlx4_en.conf
> >> install mlx4_core modprobe --ignore-install $((modprobe -c | grep -wq
> >> "^allow_unsupported_modules") && echo '--allow-unsupported-modules')
> >> mlx4_core && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q
> >> "^MLX4_EN_LOAD=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then
> >> modprobe mlx4_en; fi; else modprobe mlx4_en; fi
> >> install mlx4_en modprobe --ignore-install $((modprobe -c | grep -wq
> >> "^allow_unsupported_modules") && echo '--allow-unsupported-modules')
> >> mlx4_en && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q
> >> "^RUN_SYSCTL=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then
> >> /sbin/sysctl_perf_tuning load; fi; fi
> >> remove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r
> >> --ignore-remove mlx4_en
> >> # Configure Flow Control
> >> # pfctx:Priority based Flow Control policy on TX[7:0]. Per priority bit
> >> mask (uint)
> >> # pfcrx:Priority based Flow Control policy on RX[7:0]. Per priority bit
> >> mask (uint)
> >> options mlx4_core pfctx=0 pfcrx=0
> >>
> >> (whose last line we later modify to  "options mlx4_core pfctx=0 pfcrx=0
> >> log_num_mtt=20 log_mtts_per_seg=4" for gpfs memory considerations)
> >    The pfctx and pfcrx are MLNX OFED specific and I have no idea what they do.
> > On the other hand, the log_xxx parameters make sense to allow registering lots of
> > memory. However with the newest OFED, the log_num_mtt does no longer exists
> > as the tuning is automatically done in the driver to allow registering twice
> > the size of the physical memory (if I remember correctly).
> >
> >>
> >> But ofed-2.2 leaves the file like this:
> >>
> >> # cat /etc/modprobe.d/mlnx.conf
> >> # Module parameters for MLNX_OFED kernel modules
> >> blacklist mlx4_core
> >> blacklist mlx4_en
> >> blacklist mlx5_core
> >> blacklist mlx5_ib
> >>
> >> Should I add here the "options mlx4_core pfctx=0 pfcrx=0 log_num_mtt=20
> >> log_mtts_per_seg=4" line? Or should I add it to mlx5_core? Aren't them
> >> blacklisted?
> >    First mlx4 is for ConnectX[1-3] devices and mlx5 for Connect-IB device and from
> > your description I suppose you have ConnectX devices so you can forget about mlx5.
> >
> >    Do not add log_num_mtt (it will prevent the driver from loading), you can keep
> > log_mtts_per_seg if it helps. However no idea concerning pfctx=0 pfcrx=0.
> >
> >    What do 'modinfo mlx4_core' gives? If those pfctx and pfcrx are listed then
> > you can probably keep them.
> >
> >    Hope this helps,
> >
> >    Sébastien.
> 
> Hi Sébastien,
> 
> It seems lots of things are going on here.
> 
> First of all, the modinfos:
> 
> On the mellanox ofed 1.5.3 node:
> 
> 
> [root at compute-1-11 ~]# modinfo mlx4_en
> filename: 
> /lib/modules/2.6.32-220.13.1.el6.x86_64/updates/drivers/net/mlx4/mlx4_en.ko
> version:        1.5.7 (Nov 2011)
> license:        Dual BSD/GPL
> description:    Mellanox ConnectX HCA Ethernet driver
> author:         Liran Liss, Yevgeny Petrilin
> srcversion:     52D43E38AA89B6F12BDB95F
> alias:          pci:v000015B3d0000100Fsv*sd*bc*sc*i*
> ...
> alias:          pci:v000015B3d00006340sv*sd*bc*sc*i*
> depends:        mlx4_core
> vermagic:       2.6.32-220.13.1.el6.x86_64 SMP mod_unload modversions
> parm:           inline_thold:treshold for using inline data (int)
> parm:           num_rx_rings:Total number of RX Rings (default 16, range 
> 1-16, power of 2) (uint)
> parm:           udp_rss:Enable RSS for incomming UDP traffic or disabled 
> (0) (bool)
> parm:           num_lro:Number of LRO sessions per ring or disabled (0) 
> (uint)
> parm:           use_tx_polling:Use polling for TX processing (default 1) 
> (bool)
> parm:           enable_sys_tune:Tune the cpu's for better performance 
> (default 0) (bool)
> 
> 
> [root at compute-1-11 ~]# modinfo mlx4_core
> filename: 
> /lib/modules/2.6.32-220.13.1.el6.x86_64/updates/drivers/net/mlx4/mlx4_core.ko
> version:        1.0-mlnx_ofed1.5.3
> license:        Dual BSD/GPL
> description:    Mellanox ConnectX HCA low-level driver
> author:         Roland Dreier
> srcversion:     B261CBCA522DDF6A81AA2D6
> alias:          pci:v000015B3d0000100Fsv*sd*bc*sc*i*
> ...
> alias:          pci:v000015B3d00006340sv*sd*bc*sc*i*
> depends:
> vermagic:       2.6.32-220.13.1.el6.x86_64 SMP mod_unload modversions
> parm:           set_4k_mtu:attempt to set 4K MTU to all ConnectX ports (int)
> parm:           pfctx:Priority based Flow Control policy on TX[7:0]. Per 
> priority bit mask (uint)
> parm:           pfcrx:Priority based Flow Control policy on RX[7:0]. Per 
> priority bit mask (uint)
> parm:           debug_level:Enable debug tracing if > 0 (int)
> parm:           block_loopback:Block multicast loopback packets if > 0 (int)
> parm:           msi_x:attempt to use MSI-X if nonzero (int)
> parm:           high_rate_steer:Enable steering mode for higher packet 
> rate (default off) (int)
> parm:           sr_iov:enable #sr_iov functions if sr_iov > 0 (int)
> parm:           probe_vf:number of vfs to probe by pf driver (sr_iov > 
> 0) (int)
> parm:           log_num_mac:Log2 max number of MACs per ETH port (1-7) (int)
> parm:           use_prio:Enable steering by VLAN priority on ETH ports 
> (0/1, default 0) (bool)
> parm:           fast_drop:Enable fast packet drop when no recieve WQEs 
> are posted (int)
> parm:           log_num_qp:log maximum number of QPs per HCA (int)
> parm:           log_num_srq:log maximum number of SRQs per HCA (int)
> parm:           log_rdmarc_per_qp:log number of RDMARC buffers per QP (int)
> parm:           log_num_cq:log maximum number of CQs per HCA (int)
> parm:           log_num_mcg:log maximum number of multicast groups per 
> HCA (int)
> parm:           log_num_mpt:log maximum number of memory protection 
> table entries per HCA (int)
> parm:           log_num_mtt:log maximum number of memory translation 
> table segments per HCA (int)
> parm:           log_mtts_per_seg:Log2 number of MTT entries per segment 
> (0-7) (int)
> parm:           enable_qos:Enable Quality of Service support in the HCA 
> (default: off) (bool)
> parm:           enable_pre_t11_mode:For FCoXX, enable pre-t11 mode if 
> non-zero (default: 0) (int)
> parm:           internal_err_reset:Reset device on internal errors if 
> non-zero (default 1) (int)
> 
> 
> 
> 
> And the mellanox ofed 2.2 node:
> 
> [root at compute-2-1 ~]# modinfo mlx4_en
> filename: 
> /lib/modules/2.6.32-431.11.2.el6.x86_64/extra/mlnx-ofa_kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko
> version:        2.2-1.0.0 (Jun 23 2014)
> license:        Dual BSD/GPL
> description:    Mellanox ConnectX HCA Ethernet driver
> author:         Liran Liss, Yevgeny Petrilin
> srcversion:     D7067BE4EB268A8A2D19B64
> ¿¿¿NO ALIAS HERE???
> depends:        mlx4_core,compat,ptp
> vermagic:       2.6.32-431.11.2.el6.x86_64 SMP mod_unload modversions
> parm:           inline_thold:threshold for using inline data (uint)
> parm:           udp_rss:Enable RSS for incoming UDP traffic (uint)
> parm:           num_lro:Dummy module parameter to prevent loading issues 
> (uint)
> parm:           pfctx:Priority based Flow Control policy on TX[7:0]. Per 
> priority bit mask (uint)
> parm:           pfcrx:Priority based Flow Control policy on RX[7:0]. Per 
> priority bit mask (uint)
> 
> [root at compute-2-1 ~]# modinfo mlx4_core
> filename: 
> /lib/modules/2.6.32-431.11.2.el6.x86_64/extra/mlnx-ofa_kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko
> version:        1.1
> license:        Dual BSD/GPL
> description:    Mellanox ConnectX HCA low-level driver
> author:         Roland Dreier
> srcversion:     9A90DAE92A2E75BF5F67A24
> alias:          pci:v000015B3d00001010sv*sd*bc*sc*i*
> ...
> alias:          pci:v000015B3d00006340sv*sd*bc*sc*i*
> depends:        compat
> vermagic:       2.6.32-431.11.2.el6.x86_64 SMP mod_unload modversions
> parm:           set_4k_mtu:(Obsolete) attempt to set 4K MTU to all 
> ConnectX ports (int)
> parm:           debug_level:Enable debug tracing if > 0 (int)
> parm:           msi_x:0 - don't use MSI-X, 1 - use MSI-X, >1 - limit 
> number of MSI-X irqs to msi_x (non-SRIOV only) (int)
> parm:           enable_sys_tune:Tune the cpu's for better performance 
> (default 0) (int)
> parm:           block_loopback:Block multicast loopback packets if > 0 
> (default: 1) (int)
> parm:           num_vfs:Either single value (e.g. '5') or triplet (e.g. 
> '10,11,12') to define uniform num_vfs value for all devices functions.
>                  If a single value is given, this value will be used in 
> order to define <num_vfs> dual ports virtual functions.
>                  If a triplet <a,b,c> is given, <a> single port virtual 
> functions are defined on port1, <b> single port
>                  virtual functions are defined on port2 and <c> dual 
> port virtual functions are defined.
>                  Alternatively, a string to map device function numbers 
> to their num_vfs values
>                   (e.g. '0000:04:00.0-5,002b:1c:0b.a-15;2;4') could be 
> given.
>                  Hexadecimal digits for the device function (e.g. 
> 002b:1c:0b.a) and decimal or triplet for num_vfs value
>                  (e.g. 15 or 1;2;3). (string)
> parm:           probe_vf:Either single value (e.g. '3') or triplet (e.g 
> '1,2,3') to define uniform number of VFs to probe by the pf
>                  driver for all devices functions.
>                  If a single value is given, this value will be used in 
> order to define <probe_vf> probed dual ports virtual
>                  functions. If a triplet <a,b,c> is given, <a> single 
> port virtual functions are probed on port1, <b> single port
>                  virtual functions are probed on port2 and <c> dual port 
> virtual functions are probed.
>                  Alternatively, a string to map device function numbers 
> to their probe_vf values
>                  (e.g. '0000:04:00.0-3,002b:1c:0b.a-13;12;11') could be 
> given.
>                  Hexadecimal digits for the device function (e.g. 
> 002b:1c:0b.a) and decimal for probe_vf value (e.g. 13 or 1;2;3). (string)
> parm:           log_num_mgm_entry_size:log mgm size, that defines the 
> num of qp per mcg, for example: 10 gives 248.range: 7 <= 
> log_num_mgm_entry_size <= 12. To activate device managed flow steering 
> when available, set to -1 (int)
> parm:           high_rate_steer:Enable steering mode for higher packet 
> rate (default off) (int)
> parm:           fast_drop:Enable fast packet drop when no recieve WQEs 
> are posted (int)
> parm:           enable_64b_cqe_eqe:Enable 64 byte CQEs/EQEs when the the 
> FW supports this if non-zero (default: 1) (int)
> parm:           log_num_mac:Log2 max number of MACs per ETH port (1-7) (int)
> parm:           log_num_vlan:(Obsolete) Log2 max number of VLANs per ETH 
> port (0-7) (int)
> parm:           log_mtts_per_seg:Log2 number of MTT entries per segment 
> (0-7) (default: 0) (int)
> parm:           port_type_array:Either pair of values (e.g. '1,2') to 
> define uniform port1/port2 types configuration for all devices functions
>                  or a string to map device function numbers to their 
> pair of port types values (e.g. '0000:04:00.0-1;2,002b:1c:0b.a-1;1').
>                  Valid port types: 1-ib, 2-eth, 3-auto, 4-N/A
>                  In case that only one port is available use the N/A 
> port type for port2 (e.g '1,4'). (string)
> parm:           log_num_qp:log maximum number of QPs per HCA (default: 
> 19) (int)
> parm:           log_num_srq:log maximum number of SRQs per HCA (default: 
> 16) (int)
> parm:           log_rdmarc_per_qp:log number of RDMARC buffers per QP 
> (default: 4) (int)
> parm:           log_num_cq:log maximum number of CQs per HCA (default: 
> 16) (int)
> parm:           log_num_mcg:log maximum number of multicast groups per 
> HCA (default: 13) (int)
> parm:           log_num_mpt:log maximum number of memory protection 
> table entries per HCA (default: 19) (int)
> parm:           log_num_mtt:log maximum number of memory translation 
> table segments per HCA (default: max(20, 2*MTTs for register all of the 
> host memory limited to 30)) (int)
> parm:           enable_qos:Enable Quality of Service support in the HCA 
> (default: off) (bool)
> parm:           internal_err_reset:Reset device on internal errors if 
> non-zero (default 0) (int)
> 
> 
> 
> 
> 
> So, it seems that, on mellanox-ofed-1.5.3, all those 4 parameters ( 
> pfctx, pfcrx, log_num_mtt and log_mtts_per_seg ) were on mlx4_core.
> But in mellanox-ofed-2.2, log_num_mtt and log_mtts_per_seg stayed in 
> mlx4_core while pfctx and pfcrx moved to mlx4_en.
> 
> Yes, we were told to add those 2 parameters (log_num_mtt and 
> log_mtts_per_seg) to allow GPFS to use up to 6GB of RAM as cache. The 
> other 2 (pfctx and pfcrx) were set by default in the modprobe.d file. It 
> seems that log_num_mtt still exists in mellanox-2.2.

  First thing why do you bother with mlx4_en if you're running an IB cluster?
Just don't load that module.

> 
> 
> Should we add them in our modprobe.d like this???
> 
> [root at compute-2-1 ~]# cat /etc/modprobe.d/mlnx.conf
> # Module parameters for MLNX_OFED kernel modules
> blacklist mlx4_core
> blacklist mlx4_en
> blacklist mlx5_core
> blacklist mlx5_ib
> options mlx4_core log_num_mtt=20 log_mtts_per_seg=4
> options mlx4_en pfctx=0 pfcrx=0

  Looks OK to me, but you can drop the last line if you have no use for mlx4_en.

> 
> 
> 
> 
> 
> 
> Besides that, there is also the following differences between 
> mellanox-1.5.3 and 2.2 (we never manually modified them):
> 
> 1.5.3:
> [root at compute-1-11 ~]# cat /etc/modprobe.conf
> alias ib0 ib_ipoib
> 
> [root at compute-1-11 ~]# cat /etc/modprobe.d/ib_ipoib.conf
> # install ib_ipoib modprobe --ignore-install ib_ipoib && 
> /sbin/ib_ipoib_sysctl load
> # remove ib_ipoib /sbin/ib_ipoib_sysctl unload ; modprobe -r 
> --ignore-remove ib_ipoib
> options ib_ipoib lro=1
> alias ib0 ib_ipoib
> alias ib1 ib_ipoib
> 
> 
> 2.2:
> [root at compute-2-1 ~]# cat /etc/modprobe.d/ib_ipoib.conf
> # install ib_ipoib modprobe --ignore-install ib_ipoib && 
> /sbin/sysctl_perf_tuning load
> # remove ib_ipoib /sbin/sysctl_perf_tuning unload ; modprobe -r 
> --ignore-remove ib_ipoib
> 
> alias netdev-ib0 ib_ipoib
> alias netdev-ib1 ib_ipoib
> alias netdev-ib2 ib_ipoib
> alias netdev-ib3 ib_ipoib
> alias netdev-ib4 ib_ipoib
> alias netdev-ib5 ib_ipoib
> 
> 
> 
> Could this explain the mac address issue??

  No, this renaming is needed for kernel after 2.6.32 to avoid spamming
the logs with:

   "Loading kernel module for a network device with CAP_SYS_MODULE (deprecated).
    Use CAP_NET_ADMIN and alias netdev-ib0 instead"

entries, and has nothing to do with the address change.

  After looking at the ipoib module sources from MLNX OFED 2.2, it looks like
flags bit 2 from the HW address means that the ipoib interface supports TSS.

  The difference in the QP number part of the HW address possibly means that more
QPs have been reserved for the driver's use, and the first client (ipoib) QP number
gets shifted.

> 
> 
> 
> There are also differences in two files under /etc/infiniband:
> 
> # diff /etc/infiniband/info-1.5.3 /etc/infiniband/info-2.2
> 4c4
> < echo Kernel=2.6.32-220.13.1.el6.x86_64
> ---
>  > echo Kernel=2.6.32-431.11.2.el6.x86_64
> 6c6
> < echo "Configure options: --with-core-mod --with-user_mad-mod 
> --with-user_access-mod --with-addr_trans-mod --with-mthca-mod 
> --with-mlx4-mod --with-mlx4_en-mod --with-mlx4_ib-mod 
> --with-mlx4_vnic-mod --with-nes-mod --with-qib-mod --with-ipoib-mod 
> --with-iser-mod --with-sdp-mod --with-srp-mod --with-rds-mod"
> ---
>  > echo "Configure options: --with-core-mod --with-user_mad-mod 
> --with-user_access-mod --with-addr_trans-mod --with-mthca-mod 
> --with-mlx4-mod --with-mlx5-mod --with-mlx4_en-mod --with-mlx4_vnic-mod 
> --with-cxgb3-mod --with-cxgb4-mod --with-nes-mod --with-qib-mod 
> --with-ipoib-mod --with-iser-mod --with-e_ipoib-mod --with-srp-mod 
> --with-rds-mod --with-nfsrdma-mod"

  Just more drivers built, nothing to worry about here.

> 
> 
> # diff /etc/infiniband/openib.conf-1.5.3 /etc/infiniband/openib.conf-2.2
> 8c8,20
> < NODE_DESC_TIME_BEFORE_UPDATE=10
> ---
>  > NODE_DESC_TIME_BEFORE_UPDATE=20

  this one is OK


>  >
>  > # Set rx_channels/tx_channels to 1 to disable IPoIB RSS/TSS
>  > SET_IPOIB_CHANNELS=no

  Seems to disable RSS

>  >
>  > # Run /usr/sbin/mlnx_affinity
>  > RUN_AFFINITY_TUNER=no

  Don't know what it is

>  >
>  > # Load UMAD module
>  > UMAD_LOAD=yes

  OK if you want to run OpenSM or infiniband diags on the node

>  >
>  > # Load UVERBS module
>  > UVERBS_LOAD=yes

  Needed for MPI and such.

> 11c23
> < UCM_LOAD=no
> ---
>  > UCM_LOAD=yes

  You need that.

> 26c38
> < MTHCA_LOAD=yes
> ---
>  > MTHCA_LOAD=no

  Set to 'no' unless you still have old InfiniHost devices.

> 33a46,48
>  > # Load MLX5 modules
>  > MLX5_LOAD=yes

  Set to 'no' unless you have Connect-IB devices. I tend to avoid having
unneeded modules loaded.

>  >
> 39a55,60
>  > # Load CXGB3 modules
>  > CXGB3_LOAD=no
>  >
>  > # Load CXGB4 modules
>  > CXGB4_LOAD=no

  Leave it to 'no' if you don't have Chelsio devices

>  >
> 41c62
> < NES_LOAD=yes
> ---
>  > NES_LOAD=no

  Leave it to 'no', the iWarp nes driver from Mellanox is an empty stub anyway.

> 47c68,71
> < SET_IPOIB_CM=yes
> ---
>  > SET_IPOIB_CM=auto

  If you want IPoIB Connected Mode, set it to 'yes', otherwise if 'auto' connected mode
will only be enabled with Connect-IB devices, not on ConnectX devices.

>  >
>  > # Load E_IPoIB
>  > E_IPOIB_LOAD=no
> 49,50d72
> < # Load SDP module
> < SDP_LOAD=no
> 55,57d76
> < # Load ISER module
> < ISER_LOAD=no

  Leave as is if it was not enabled on the previous OFED.

> <
> 
> 
> Could any of this matter? Maybe the SET_IPOIB_CM=yes/auto? Or the 
> MLX5_LOAD=yes?

  Not for the address change.

  Sébastien.

> 
> 
> Thanks,
> 
> Txema
> 
> 
> 



More information about the Users mailing list