[Users] Compatibility problems between OFED 1.5.3 and OFED 2.2 ?

Txema Heredia txema.llistes at gmail.com
Fri Jun 27 07:08:07 PDT 2014


El 27/06/14 11:16, Sébastien Dugué escribió:
> Hi Txema,
>
> On Thu, 26 Jun 2014 16:01:59 +0200
> Txema Heredia <txema.llistes at gmail.com> wrote:
>
>> Thanks Sébastien!
>>
>> I was worried because, when some of my colleagues tried to add to the
>> GPFS cluster some nodes using ofed 2.2, something went wrong and the
>> whole infiniband network collapsed. That's why I was wary of the change.
>> Right now I am adding a couple of 2.2 nodes to the GPFS to check if the
>> problem was due to ofed or some other misconfiguration. I'll report back
>> if I detect any problem.
>>
>>
>> As for ofed, I have a couple of questions:
>>
>> - Is it safe/transparent to update from 1.5.3 to 2.2? Should I update my
>> gpfs servers? Should I wait? Should I keep them on 1.5.3? Would that
>> cause problems in the future?
>    I can only speak concerning the community OFED. In fact we're migrating from
> 1.5.4.1 to 3.12 which are very roughly equivalent to mlnx 1.5.3 and mlnx 2.2 and
> so far, it's only a matter of un-installing the old one and installing the
> new one with a few kernel module parameter changes.
>
>> - Why is the mellanox ofed installer changing my ib0 mac address?? When
>> kickstarting the node, the ib0 mac address is
>> 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:55:a4:b9, but after
>> installing the mellanox drivers it changes to
>> a0:00:01:00:fe:80:00:00:00:00:00:00:00:02:c9:03:00:55:a4:b9. This makes
>> the node not to load the ib0 interface and start the GPFS service
>> through ethernet, until you manually "ifup ib0" the node.
>    It's not the installer that change the HW address.
>
>    An IPoIB address is constructed as follows:
>
>    80		flags (bit 7 = Connected Mode)
>    00:00:48	QP number (which may change if the ipoib module is reloaded
>    fe:80...	Port GID
>
>    Concerning the change of the flags from 80 to a0, I've no idea what flags bit 2
> means (this is defined in the standard ipoib driver).
>
>    As to the QP number, you must be prepared for it to possibly change if you
> reload the ipoib driver after the system has booted and some QP have been created.
>
>> - Why do the modules change from ofed 1.5.3 to 2.2? My 1.5.3
>> installation generates the following file:
>>
>> # cat /etc/modprobe.d/mlx4_en.conf
>> install mlx4_core modprobe --ignore-install $((modprobe -c | grep -wq
>> "^allow_unsupported_modules") && echo '--allow-unsupported-modules')
>> mlx4_core && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q
>> "^MLX4_EN_LOAD=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then
>> modprobe mlx4_en; fi; else modprobe mlx4_en; fi
>> install mlx4_en modprobe --ignore-install $((modprobe -c | grep -wq
>> "^allow_unsupported_modules") && echo '--allow-unsupported-modules')
>> mlx4_en && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q
>> "^RUN_SYSCTL=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then
>> /sbin/sysctl_perf_tuning load; fi; fi
>> remove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r
>> --ignore-remove mlx4_en
>> # Configure Flow Control
>> # pfctx:Priority based Flow Control policy on TX[7:0]. Per priority bit
>> mask (uint)
>> # pfcrx:Priority based Flow Control policy on RX[7:0]. Per priority bit
>> mask (uint)
>> options mlx4_core pfctx=0 pfcrx=0
>>
>> (whose last line we later modify to  "options mlx4_core pfctx=0 pfcrx=0
>> log_num_mtt=20 log_mtts_per_seg=4" for gpfs memory considerations)
>    The pfctx and pfcrx are MLNX OFED specific and I have no idea what they do.
> On the other hand, the log_xxx parameters make sense to allow registering lots of
> memory. However with the newest OFED, the log_num_mtt does no longer exists
> as the tuning is automatically done in the driver to allow registering twice
> the size of the physical memory (if I remember correctly).
>
>>
>> But ofed-2.2 leaves the file like this:
>>
>> # cat /etc/modprobe.d/mlnx.conf
>> # Module parameters for MLNX_OFED kernel modules
>> blacklist mlx4_core
>> blacklist mlx4_en
>> blacklist mlx5_core
>> blacklist mlx5_ib
>>
>> Should I add here the "options mlx4_core pfctx=0 pfcrx=0 log_num_mtt=20
>> log_mtts_per_seg=4" line? Or should I add it to mlx5_core? Aren't them
>> blacklisted?
>    First mlx4 is for ConnectX[1-3] devices and mlx5 for Connect-IB device and from
> your description I suppose you have ConnectX devices so you can forget about mlx5.
>
>    Do not add log_num_mtt (it will prevent the driver from loading), you can keep
> log_mtts_per_seg if it helps. However no idea concerning pfctx=0 pfcrx=0.
>
>    What do 'modinfo mlx4_core' gives? If those pfctx and pfcrx are listed then
> you can probably keep them.
>
>    Hope this helps,
>
>    Sébastien.

Hi Sébastien,

It seems lots of things are going on here.

First of all, the modinfos:

On the mellanox ofed 1.5.3 node:


[root at compute-1-11 ~]# modinfo mlx4_en
filename: 
/lib/modules/2.6.32-220.13.1.el6.x86_64/updates/drivers/net/mlx4/mlx4_en.ko
version:        1.5.7 (Nov 2011)
license:        Dual BSD/GPL
description:    Mellanox ConnectX HCA Ethernet driver
author:         Liran Liss, Yevgeny Petrilin
srcversion:     52D43E38AA89B6F12BDB95F
alias:          pci:v000015B3d0000100Fsv*sd*bc*sc*i*
...
alias:          pci:v000015B3d00006340sv*sd*bc*sc*i*
depends:        mlx4_core
vermagic:       2.6.32-220.13.1.el6.x86_64 SMP mod_unload modversions
parm:           inline_thold:treshold for using inline data (int)
parm:           num_rx_rings:Total number of RX Rings (default 16, range 
1-16, power of 2) (uint)
parm:           udp_rss:Enable RSS for incomming UDP traffic or disabled 
(0) (bool)
parm:           num_lro:Number of LRO sessions per ring or disabled (0) 
(uint)
parm:           use_tx_polling:Use polling for TX processing (default 1) 
(bool)
parm:           enable_sys_tune:Tune the cpu's for better performance 
(default 0) (bool)


[root at compute-1-11 ~]# modinfo mlx4_core
filename: 
/lib/modules/2.6.32-220.13.1.el6.x86_64/updates/drivers/net/mlx4/mlx4_core.ko
version:        1.0-mlnx_ofed1.5.3
license:        Dual BSD/GPL
description:    Mellanox ConnectX HCA low-level driver
author:         Roland Dreier
srcversion:     B261CBCA522DDF6A81AA2D6
alias:          pci:v000015B3d0000100Fsv*sd*bc*sc*i*
...
alias:          pci:v000015B3d00006340sv*sd*bc*sc*i*
depends:
vermagic:       2.6.32-220.13.1.el6.x86_64 SMP mod_unload modversions
parm:           set_4k_mtu:attempt to set 4K MTU to all ConnectX ports (int)
parm:           pfctx:Priority based Flow Control policy on TX[7:0]. Per 
priority bit mask (uint)
parm:           pfcrx:Priority based Flow Control policy on RX[7:0]. Per 
priority bit mask (uint)
parm:           debug_level:Enable debug tracing if > 0 (int)
parm:           block_loopback:Block multicast loopback packets if > 0 (int)
parm:           msi_x:attempt to use MSI-X if nonzero (int)
parm:           high_rate_steer:Enable steering mode for higher packet 
rate (default off) (int)
parm:           sr_iov:enable #sr_iov functions if sr_iov > 0 (int)
parm:           probe_vf:number of vfs to probe by pf driver (sr_iov > 
0) (int)
parm:           log_num_mac:Log2 max number of MACs per ETH port (1-7) (int)
parm:           use_prio:Enable steering by VLAN priority on ETH ports 
(0/1, default 0) (bool)
parm:           fast_drop:Enable fast packet drop when no recieve WQEs 
are posted (int)
parm:           log_num_qp:log maximum number of QPs per HCA (int)
parm:           log_num_srq:log maximum number of SRQs per HCA (int)
parm:           log_rdmarc_per_qp:log number of RDMARC buffers per QP (int)
parm:           log_num_cq:log maximum number of CQs per HCA (int)
parm:           log_num_mcg:log maximum number of multicast groups per 
HCA (int)
parm:           log_num_mpt:log maximum number of memory protection 
table entries per HCA (int)
parm:           log_num_mtt:log maximum number of memory translation 
table segments per HCA (int)
parm:           log_mtts_per_seg:Log2 number of MTT entries per segment 
(0-7) (int)
parm:           enable_qos:Enable Quality of Service support in the HCA 
(default: off) (bool)
parm:           enable_pre_t11_mode:For FCoXX, enable pre-t11 mode if 
non-zero (default: 0) (int)
parm:           internal_err_reset:Reset device on internal errors if 
non-zero (default 1) (int)




And the mellanox ofed 2.2 node:

[root at compute-2-1 ~]# modinfo mlx4_en
filename: 
/lib/modules/2.6.32-431.11.2.el6.x86_64/extra/mlnx-ofa_kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko
version:        2.2-1.0.0 (Jun 23 2014)
license:        Dual BSD/GPL
description:    Mellanox ConnectX HCA Ethernet driver
author:         Liran Liss, Yevgeny Petrilin
srcversion:     D7067BE4EB268A8A2D19B64
¿¿¿NO ALIAS HERE???
depends:        mlx4_core,compat,ptp
vermagic:       2.6.32-431.11.2.el6.x86_64 SMP mod_unload modversions
parm:           inline_thold:threshold for using inline data (uint)
parm:           udp_rss:Enable RSS for incoming UDP traffic (uint)
parm:           num_lro:Dummy module parameter to prevent loading issues 
(uint)
parm:           pfctx:Priority based Flow Control policy on TX[7:0]. Per 
priority bit mask (uint)
parm:           pfcrx:Priority based Flow Control policy on RX[7:0]. Per 
priority bit mask (uint)

[root at compute-2-1 ~]# modinfo mlx4_core
filename: 
/lib/modules/2.6.32-431.11.2.el6.x86_64/extra/mlnx-ofa_kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko
version:        1.1
license:        Dual BSD/GPL
description:    Mellanox ConnectX HCA low-level driver
author:         Roland Dreier
srcversion:     9A90DAE92A2E75BF5F67A24
alias:          pci:v000015B3d00001010sv*sd*bc*sc*i*
...
alias:          pci:v000015B3d00006340sv*sd*bc*sc*i*
depends:        compat
vermagic:       2.6.32-431.11.2.el6.x86_64 SMP mod_unload modversions
parm:           set_4k_mtu:(Obsolete) attempt to set 4K MTU to all 
ConnectX ports (int)
parm:           debug_level:Enable debug tracing if > 0 (int)
parm:           msi_x:0 - don't use MSI-X, 1 - use MSI-X, >1 - limit 
number of MSI-X irqs to msi_x (non-SRIOV only) (int)
parm:           enable_sys_tune:Tune the cpu's for better performance 
(default 0) (int)
parm:           block_loopback:Block multicast loopback packets if > 0 
(default: 1) (int)
parm:           num_vfs:Either single value (e.g. '5') or triplet (e.g. 
'10,11,12') to define uniform num_vfs value for all devices functions.
                 If a single value is given, this value will be used in 
order to define <num_vfs> dual ports virtual functions.
                 If a triplet <a,b,c> is given, <a> single port virtual 
functions are defined on port1, <b> single port
                 virtual functions are defined on port2 and <c> dual 
port virtual functions are defined.
                 Alternatively, a string to map device function numbers 
to their num_vfs values
                  (e.g. '0000:04:00.0-5,002b:1c:0b.a-15;2;4') could be 
given.
                 Hexadecimal digits for the device function (e.g. 
002b:1c:0b.a) and decimal or triplet for num_vfs value
                 (e.g. 15 or 1;2;3). (string)
parm:           probe_vf:Either single value (e.g. '3') or triplet (e.g 
'1,2,3') to define uniform number of VFs to probe by the pf
                 driver for all devices functions.
                 If a single value is given, this value will be used in 
order to define <probe_vf> probed dual ports virtual
                 functions. If a triplet <a,b,c> is given, <a> single 
port virtual functions are probed on port1, <b> single port
                 virtual functions are probed on port2 and <c> dual port 
virtual functions are probed.
                 Alternatively, a string to map device function numbers 
to their probe_vf values
                 (e.g. '0000:04:00.0-3,002b:1c:0b.a-13;12;11') could be 
given.
                 Hexadecimal digits for the device function (e.g. 
002b:1c:0b.a) and decimal for probe_vf value (e.g. 13 or 1;2;3). (string)
parm:           log_num_mgm_entry_size:log mgm size, that defines the 
num of qp per mcg, for example: 10 gives 248.range: 7 <= 
log_num_mgm_entry_size <= 12. To activate device managed flow steering 
when available, set to -1 (int)
parm:           high_rate_steer:Enable steering mode for higher packet 
rate (default off) (int)
parm:           fast_drop:Enable fast packet drop when no recieve WQEs 
are posted (int)
parm:           enable_64b_cqe_eqe:Enable 64 byte CQEs/EQEs when the the 
FW supports this if non-zero (default: 1) (int)
parm:           log_num_mac:Log2 max number of MACs per ETH port (1-7) (int)
parm:           log_num_vlan:(Obsolete) Log2 max number of VLANs per ETH 
port (0-7) (int)
parm:           log_mtts_per_seg:Log2 number of MTT entries per segment 
(0-7) (default: 0) (int)
parm:           port_type_array:Either pair of values (e.g. '1,2') to 
define uniform port1/port2 types configuration for all devices functions
                 or a string to map device function numbers to their 
pair of port types values (e.g. '0000:04:00.0-1;2,002b:1c:0b.a-1;1').
                 Valid port types: 1-ib, 2-eth, 3-auto, 4-N/A
                 In case that only one port is available use the N/A 
port type for port2 (e.g '1,4'). (string)
parm:           log_num_qp:log maximum number of QPs per HCA (default: 
19) (int)
parm:           log_num_srq:log maximum number of SRQs per HCA (default: 
16) (int)
parm:           log_rdmarc_per_qp:log number of RDMARC buffers per QP 
(default: 4) (int)
parm:           log_num_cq:log maximum number of CQs per HCA (default: 
16) (int)
parm:           log_num_mcg:log maximum number of multicast groups per 
HCA (default: 13) (int)
parm:           log_num_mpt:log maximum number of memory protection 
table entries per HCA (default: 19) (int)
parm:           log_num_mtt:log maximum number of memory translation 
table segments per HCA (default: max(20, 2*MTTs for register all of the 
host memory limited to 30)) (int)
parm:           enable_qos:Enable Quality of Service support in the HCA 
(default: off) (bool)
parm:           internal_err_reset:Reset device on internal errors if 
non-zero (default 0) (int)





So, it seems that, on mellanox-ofed-1.5.3, all those 4 parameters ( 
pfctx, pfcrx, log_num_mtt and log_mtts_per_seg ) were on mlx4_core.
But in mellanox-ofed-2.2, log_num_mtt and log_mtts_per_seg stayed in 
mlx4_core while pfctx and pfcrx moved to mlx4_en.

Yes, we were told to add those 2 parameters (log_num_mtt and 
log_mtts_per_seg) to allow GPFS to use up to 6GB of RAM as cache. The 
other 2 (pfctx and pfcrx) were set by default in the modprobe.d file. It 
seems that log_num_mtt still exists in mellanox-2.2.


Should we add them in our modprobe.d like this???

[root at compute-2-1 ~]# cat /etc/modprobe.d/mlnx.conf
# Module parameters for MLNX_OFED kernel modules
blacklist mlx4_core
blacklist mlx4_en
blacklist mlx5_core
blacklist mlx5_ib
options mlx4_core log_num_mtt=20 log_mtts_per_seg=4
options mlx4_en pfctx=0 pfcrx=0






Besides that, there is also the following differences between 
mellanox-1.5.3 and 2.2 (we never manually modified them):

1.5.3:
[root at compute-1-11 ~]# cat /etc/modprobe.conf
alias ib0 ib_ipoib

[root at compute-1-11 ~]# cat /etc/modprobe.d/ib_ipoib.conf
# install ib_ipoib modprobe --ignore-install ib_ipoib && 
/sbin/ib_ipoib_sysctl load
# remove ib_ipoib /sbin/ib_ipoib_sysctl unload ; modprobe -r 
--ignore-remove ib_ipoib
options ib_ipoib lro=1
alias ib0 ib_ipoib
alias ib1 ib_ipoib


2.2:
[root at compute-2-1 ~]# cat /etc/modprobe.d/ib_ipoib.conf
# install ib_ipoib modprobe --ignore-install ib_ipoib && 
/sbin/sysctl_perf_tuning load
# remove ib_ipoib /sbin/sysctl_perf_tuning unload ; modprobe -r 
--ignore-remove ib_ipoib

alias netdev-ib0 ib_ipoib
alias netdev-ib1 ib_ipoib
alias netdev-ib2 ib_ipoib
alias netdev-ib3 ib_ipoib
alias netdev-ib4 ib_ipoib
alias netdev-ib5 ib_ipoib



Could this explain the mac address issue??



There are also differences in two files under /etc/infiniband:

# diff /etc/infiniband/info-1.5.3 /etc/infiniband/info-2.2
4c4
< echo Kernel=2.6.32-220.13.1.el6.x86_64
---
 > echo Kernel=2.6.32-431.11.2.el6.x86_64
6c6
< echo "Configure options: --with-core-mod --with-user_mad-mod 
--with-user_access-mod --with-addr_trans-mod --with-mthca-mod 
--with-mlx4-mod --with-mlx4_en-mod --with-mlx4_ib-mod 
--with-mlx4_vnic-mod --with-nes-mod --with-qib-mod --with-ipoib-mod 
--with-iser-mod --with-sdp-mod --with-srp-mod --with-rds-mod"
---
 > echo "Configure options: --with-core-mod --with-user_mad-mod 
--with-user_access-mod --with-addr_trans-mod --with-mthca-mod 
--with-mlx4-mod --with-mlx5-mod --with-mlx4_en-mod --with-mlx4_vnic-mod 
--with-cxgb3-mod --with-cxgb4-mod --with-nes-mod --with-qib-mod 
--with-ipoib-mod --with-iser-mod --with-e_ipoib-mod --with-srp-mod 
--with-rds-mod --with-nfsrdma-mod"


# diff /etc/infiniband/openib.conf-1.5.3 /etc/infiniband/openib.conf-2.2
8c8,20
< NODE_DESC_TIME_BEFORE_UPDATE=10
---
 > NODE_DESC_TIME_BEFORE_UPDATE=20
 >
 > # Set rx_channels/tx_channels to 1 to disable IPoIB RSS/TSS
 > SET_IPOIB_CHANNELS=no
 >
 > # Run /usr/sbin/mlnx_affinity
 > RUN_AFFINITY_TUNER=no
 >
 > # Load UMAD module
 > UMAD_LOAD=yes
 >
 > # Load UVERBS module
 > UVERBS_LOAD=yes
11c23
< UCM_LOAD=no
---
 > UCM_LOAD=yes
26c38
< MTHCA_LOAD=yes
---
 > MTHCA_LOAD=no
33a46,48
 > # Load MLX5 modules
 > MLX5_LOAD=yes
 >
39a55,60
 > # Load CXGB3 modules
 > CXGB3_LOAD=no
 >
 > # Load CXGB4 modules
 > CXGB4_LOAD=no
 >
41c62
< NES_LOAD=yes
---
 > NES_LOAD=no
47c68,71
< SET_IPOIB_CM=yes
---
 > SET_IPOIB_CM=auto
 >
 > # Load E_IPoIB
 > E_IPOIB_LOAD=no
49,50d72
< # Load SDP module
< SDP_LOAD=no
55,57d76
< # Load ISER module
< ISER_LOAD=no
<


Could any of this matter? Maybe the SET_IPOIB_CM=yes/auto? Or the 
MLX5_LOAD=yes?


Thanks,

Txema






More information about the Users mailing list