[Users] Compatibility problems between OFED 1.5.3 and OFED 2.2 ?
Txema Heredia
txema.llistes at gmail.com
Fri Jun 27 07:08:07 PDT 2014
El 27/06/14 11:16, Sébastien Dugué escribió:
> Hi Txema,
>
> On Thu, 26 Jun 2014 16:01:59 +0200
> Txema Heredia <txema.llistes at gmail.com> wrote:
>
>> Thanks Sébastien!
>>
>> I was worried because, when some of my colleagues tried to add to the
>> GPFS cluster some nodes using ofed 2.2, something went wrong and the
>> whole infiniband network collapsed. That's why I was wary of the change.
>> Right now I am adding a couple of 2.2 nodes to the GPFS to check if the
>> problem was due to ofed or some other misconfiguration. I'll report back
>> if I detect any problem.
>>
>>
>> As for ofed, I have a couple of questions:
>>
>> - Is it safe/transparent to update from 1.5.3 to 2.2? Should I update my
>> gpfs servers? Should I wait? Should I keep them on 1.5.3? Would that
>> cause problems in the future?
> I can only speak concerning the community OFED. In fact we're migrating from
> 1.5.4.1 to 3.12 which are very roughly equivalent to mlnx 1.5.3 and mlnx 2.2 and
> so far, it's only a matter of un-installing the old one and installing the
> new one with a few kernel module parameter changes.
>
>> - Why is the mellanox ofed installer changing my ib0 mac address?? When
>> kickstarting the node, the ib0 mac address is
>> 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:55:a4:b9, but after
>> installing the mellanox drivers it changes to
>> a0:00:01:00:fe:80:00:00:00:00:00:00:00:02:c9:03:00:55:a4:b9. This makes
>> the node not to load the ib0 interface and start the GPFS service
>> through ethernet, until you manually "ifup ib0" the node.
> It's not the installer that change the HW address.
>
> An IPoIB address is constructed as follows:
>
> 80 flags (bit 7 = Connected Mode)
> 00:00:48 QP number (which may change if the ipoib module is reloaded
> fe:80... Port GID
>
> Concerning the change of the flags from 80 to a0, I've no idea what flags bit 2
> means (this is defined in the standard ipoib driver).
>
> As to the QP number, you must be prepared for it to possibly change if you
> reload the ipoib driver after the system has booted and some QP have been created.
>
>> - Why do the modules change from ofed 1.5.3 to 2.2? My 1.5.3
>> installation generates the following file:
>>
>> # cat /etc/modprobe.d/mlx4_en.conf
>> install mlx4_core modprobe --ignore-install $((modprobe -c | grep -wq
>> "^allow_unsupported_modules") && echo '--allow-unsupported-modules')
>> mlx4_core && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q
>> "^MLX4_EN_LOAD=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then
>> modprobe mlx4_en; fi; else modprobe mlx4_en; fi
>> install mlx4_en modprobe --ignore-install $((modprobe -c | grep -wq
>> "^allow_unsupported_modules") && echo '--allow-unsupported-modules')
>> mlx4_en && if [ -e /etc/infiniband/openib.conf ]; then if ( grep -q
>> "^RUN_SYSCTL=yes" /etc/infiniband/openib.conf > /dev/null 2>&1); then
>> /sbin/sysctl_perf_tuning load; fi; fi
>> remove mlx4_en /sbin/sysctl_perf_tuning unload ; modprobe -r
>> --ignore-remove mlx4_en
>> # Configure Flow Control
>> # pfctx:Priority based Flow Control policy on TX[7:0]. Per priority bit
>> mask (uint)
>> # pfcrx:Priority based Flow Control policy on RX[7:0]. Per priority bit
>> mask (uint)
>> options mlx4_core pfctx=0 pfcrx=0
>>
>> (whose last line we later modify to "options mlx4_core pfctx=0 pfcrx=0
>> log_num_mtt=20 log_mtts_per_seg=4" for gpfs memory considerations)
> The pfctx and pfcrx are MLNX OFED specific and I have no idea what they do.
> On the other hand, the log_xxx parameters make sense to allow registering lots of
> memory. However with the newest OFED, the log_num_mtt does no longer exists
> as the tuning is automatically done in the driver to allow registering twice
> the size of the physical memory (if I remember correctly).
>
>>
>> But ofed-2.2 leaves the file like this:
>>
>> # cat /etc/modprobe.d/mlnx.conf
>> # Module parameters for MLNX_OFED kernel modules
>> blacklist mlx4_core
>> blacklist mlx4_en
>> blacklist mlx5_core
>> blacklist mlx5_ib
>>
>> Should I add here the "options mlx4_core pfctx=0 pfcrx=0 log_num_mtt=20
>> log_mtts_per_seg=4" line? Or should I add it to mlx5_core? Aren't them
>> blacklisted?
> First mlx4 is for ConnectX[1-3] devices and mlx5 for Connect-IB device and from
> your description I suppose you have ConnectX devices so you can forget about mlx5.
>
> Do not add log_num_mtt (it will prevent the driver from loading), you can keep
> log_mtts_per_seg if it helps. However no idea concerning pfctx=0 pfcrx=0.
>
> What do 'modinfo mlx4_core' gives? If those pfctx and pfcrx are listed then
> you can probably keep them.
>
> Hope this helps,
>
> Sébastien.
Hi Sébastien,
It seems lots of things are going on here.
First of all, the modinfos:
On the mellanox ofed 1.5.3 node:
[root at compute-1-11 ~]# modinfo mlx4_en
filename:
/lib/modules/2.6.32-220.13.1.el6.x86_64/updates/drivers/net/mlx4/mlx4_en.ko
version: 1.5.7 (Nov 2011)
license: Dual BSD/GPL
description: Mellanox ConnectX HCA Ethernet driver
author: Liran Liss, Yevgeny Petrilin
srcversion: 52D43E38AA89B6F12BDB95F
alias: pci:v000015B3d0000100Fsv*sd*bc*sc*i*
...
alias: pci:v000015B3d00006340sv*sd*bc*sc*i*
depends: mlx4_core
vermagic: 2.6.32-220.13.1.el6.x86_64 SMP mod_unload modversions
parm: inline_thold:treshold for using inline data (int)
parm: num_rx_rings:Total number of RX Rings (default 16, range
1-16, power of 2) (uint)
parm: udp_rss:Enable RSS for incomming UDP traffic or disabled
(0) (bool)
parm: num_lro:Number of LRO sessions per ring or disabled (0)
(uint)
parm: use_tx_polling:Use polling for TX processing (default 1)
(bool)
parm: enable_sys_tune:Tune the cpu's for better performance
(default 0) (bool)
[root at compute-1-11 ~]# modinfo mlx4_core
filename:
/lib/modules/2.6.32-220.13.1.el6.x86_64/updates/drivers/net/mlx4/mlx4_core.ko
version: 1.0-mlnx_ofed1.5.3
license: Dual BSD/GPL
description: Mellanox ConnectX HCA low-level driver
author: Roland Dreier
srcversion: B261CBCA522DDF6A81AA2D6
alias: pci:v000015B3d0000100Fsv*sd*bc*sc*i*
...
alias: pci:v000015B3d00006340sv*sd*bc*sc*i*
depends:
vermagic: 2.6.32-220.13.1.el6.x86_64 SMP mod_unload modversions
parm: set_4k_mtu:attempt to set 4K MTU to all ConnectX ports (int)
parm: pfctx:Priority based Flow Control policy on TX[7:0]. Per
priority bit mask (uint)
parm: pfcrx:Priority based Flow Control policy on RX[7:0]. Per
priority bit mask (uint)
parm: debug_level:Enable debug tracing if > 0 (int)
parm: block_loopback:Block multicast loopback packets if > 0 (int)
parm: msi_x:attempt to use MSI-X if nonzero (int)
parm: high_rate_steer:Enable steering mode for higher packet
rate (default off) (int)
parm: sr_iov:enable #sr_iov functions if sr_iov > 0 (int)
parm: probe_vf:number of vfs to probe by pf driver (sr_iov >
0) (int)
parm: log_num_mac:Log2 max number of MACs per ETH port (1-7) (int)
parm: use_prio:Enable steering by VLAN priority on ETH ports
(0/1, default 0) (bool)
parm: fast_drop:Enable fast packet drop when no recieve WQEs
are posted (int)
parm: log_num_qp:log maximum number of QPs per HCA (int)
parm: log_num_srq:log maximum number of SRQs per HCA (int)
parm: log_rdmarc_per_qp:log number of RDMARC buffers per QP (int)
parm: log_num_cq:log maximum number of CQs per HCA (int)
parm: log_num_mcg:log maximum number of multicast groups per
HCA (int)
parm: log_num_mpt:log maximum number of memory protection
table entries per HCA (int)
parm: log_num_mtt:log maximum number of memory translation
table segments per HCA (int)
parm: log_mtts_per_seg:Log2 number of MTT entries per segment
(0-7) (int)
parm: enable_qos:Enable Quality of Service support in the HCA
(default: off) (bool)
parm: enable_pre_t11_mode:For FCoXX, enable pre-t11 mode if
non-zero (default: 0) (int)
parm: internal_err_reset:Reset device on internal errors if
non-zero (default 1) (int)
And the mellanox ofed 2.2 node:
[root at compute-2-1 ~]# modinfo mlx4_en
filename:
/lib/modules/2.6.32-431.11.2.el6.x86_64/extra/mlnx-ofa_kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_en.ko
version: 2.2-1.0.0 (Jun 23 2014)
license: Dual BSD/GPL
description: Mellanox ConnectX HCA Ethernet driver
author: Liran Liss, Yevgeny Petrilin
srcversion: D7067BE4EB268A8A2D19B64
¿¿¿NO ALIAS HERE???
depends: mlx4_core,compat,ptp
vermagic: 2.6.32-431.11.2.el6.x86_64 SMP mod_unload modversions
parm: inline_thold:threshold for using inline data (uint)
parm: udp_rss:Enable RSS for incoming UDP traffic (uint)
parm: num_lro:Dummy module parameter to prevent loading issues
(uint)
parm: pfctx:Priority based Flow Control policy on TX[7:0]. Per
priority bit mask (uint)
parm: pfcrx:Priority based Flow Control policy on RX[7:0]. Per
priority bit mask (uint)
[root at compute-2-1 ~]# modinfo mlx4_core
filename:
/lib/modules/2.6.32-431.11.2.el6.x86_64/extra/mlnx-ofa_kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko
version: 1.1
license: Dual BSD/GPL
description: Mellanox ConnectX HCA low-level driver
author: Roland Dreier
srcversion: 9A90DAE92A2E75BF5F67A24
alias: pci:v000015B3d00001010sv*sd*bc*sc*i*
...
alias: pci:v000015B3d00006340sv*sd*bc*sc*i*
depends: compat
vermagic: 2.6.32-431.11.2.el6.x86_64 SMP mod_unload modversions
parm: set_4k_mtu:(Obsolete) attempt to set 4K MTU to all
ConnectX ports (int)
parm: debug_level:Enable debug tracing if > 0 (int)
parm: msi_x:0 - don't use MSI-X, 1 - use MSI-X, >1 - limit
number of MSI-X irqs to msi_x (non-SRIOV only) (int)
parm: enable_sys_tune:Tune the cpu's for better performance
(default 0) (int)
parm: block_loopback:Block multicast loopback packets if > 0
(default: 1) (int)
parm: num_vfs:Either single value (e.g. '5') or triplet (e.g.
'10,11,12') to define uniform num_vfs value for all devices functions.
If a single value is given, this value will be used in
order to define <num_vfs> dual ports virtual functions.
If a triplet <a,b,c> is given, <a> single port virtual
functions are defined on port1, <b> single port
virtual functions are defined on port2 and <c> dual
port virtual functions are defined.
Alternatively, a string to map device function numbers
to their num_vfs values
(e.g. '0000:04:00.0-5,002b:1c:0b.a-15;2;4') could be
given.
Hexadecimal digits for the device function (e.g.
002b:1c:0b.a) and decimal or triplet for num_vfs value
(e.g. 15 or 1;2;3). (string)
parm: probe_vf:Either single value (e.g. '3') or triplet (e.g
'1,2,3') to define uniform number of VFs to probe by the pf
driver for all devices functions.
If a single value is given, this value will be used in
order to define <probe_vf> probed dual ports virtual
functions. If a triplet <a,b,c> is given, <a> single
port virtual functions are probed on port1, <b> single port
virtual functions are probed on port2 and <c> dual port
virtual functions are probed.
Alternatively, a string to map device function numbers
to their probe_vf values
(e.g. '0000:04:00.0-3,002b:1c:0b.a-13;12;11') could be
given.
Hexadecimal digits for the device function (e.g.
002b:1c:0b.a) and decimal for probe_vf value (e.g. 13 or 1;2;3). (string)
parm: log_num_mgm_entry_size:log mgm size, that defines the
num of qp per mcg, for example: 10 gives 248.range: 7 <=
log_num_mgm_entry_size <= 12. To activate device managed flow steering
when available, set to -1 (int)
parm: high_rate_steer:Enable steering mode for higher packet
rate (default off) (int)
parm: fast_drop:Enable fast packet drop when no recieve WQEs
are posted (int)
parm: enable_64b_cqe_eqe:Enable 64 byte CQEs/EQEs when the the
FW supports this if non-zero (default: 1) (int)
parm: log_num_mac:Log2 max number of MACs per ETH port (1-7) (int)
parm: log_num_vlan:(Obsolete) Log2 max number of VLANs per ETH
port (0-7) (int)
parm: log_mtts_per_seg:Log2 number of MTT entries per segment
(0-7) (default: 0) (int)
parm: port_type_array:Either pair of values (e.g. '1,2') to
define uniform port1/port2 types configuration for all devices functions
or a string to map device function numbers to their
pair of port types values (e.g. '0000:04:00.0-1;2,002b:1c:0b.a-1;1').
Valid port types: 1-ib, 2-eth, 3-auto, 4-N/A
In case that only one port is available use the N/A
port type for port2 (e.g '1,4'). (string)
parm: log_num_qp:log maximum number of QPs per HCA (default:
19) (int)
parm: log_num_srq:log maximum number of SRQs per HCA (default:
16) (int)
parm: log_rdmarc_per_qp:log number of RDMARC buffers per QP
(default: 4) (int)
parm: log_num_cq:log maximum number of CQs per HCA (default:
16) (int)
parm: log_num_mcg:log maximum number of multicast groups per
HCA (default: 13) (int)
parm: log_num_mpt:log maximum number of memory protection
table entries per HCA (default: 19) (int)
parm: log_num_mtt:log maximum number of memory translation
table segments per HCA (default: max(20, 2*MTTs for register all of the
host memory limited to 30)) (int)
parm: enable_qos:Enable Quality of Service support in the HCA
(default: off) (bool)
parm: internal_err_reset:Reset device on internal errors if
non-zero (default 0) (int)
So, it seems that, on mellanox-ofed-1.5.3, all those 4 parameters (
pfctx, pfcrx, log_num_mtt and log_mtts_per_seg ) were on mlx4_core.
But in mellanox-ofed-2.2, log_num_mtt and log_mtts_per_seg stayed in
mlx4_core while pfctx and pfcrx moved to mlx4_en.
Yes, we were told to add those 2 parameters (log_num_mtt and
log_mtts_per_seg) to allow GPFS to use up to 6GB of RAM as cache. The
other 2 (pfctx and pfcrx) were set by default in the modprobe.d file. It
seems that log_num_mtt still exists in mellanox-2.2.
Should we add them in our modprobe.d like this???
[root at compute-2-1 ~]# cat /etc/modprobe.d/mlnx.conf
# Module parameters for MLNX_OFED kernel modules
blacklist mlx4_core
blacklist mlx4_en
blacklist mlx5_core
blacklist mlx5_ib
options mlx4_core log_num_mtt=20 log_mtts_per_seg=4
options mlx4_en pfctx=0 pfcrx=0
Besides that, there is also the following differences between
mellanox-1.5.3 and 2.2 (we never manually modified them):
1.5.3:
[root at compute-1-11 ~]# cat /etc/modprobe.conf
alias ib0 ib_ipoib
[root at compute-1-11 ~]# cat /etc/modprobe.d/ib_ipoib.conf
# install ib_ipoib modprobe --ignore-install ib_ipoib &&
/sbin/ib_ipoib_sysctl load
# remove ib_ipoib /sbin/ib_ipoib_sysctl unload ; modprobe -r
--ignore-remove ib_ipoib
options ib_ipoib lro=1
alias ib0 ib_ipoib
alias ib1 ib_ipoib
2.2:
[root at compute-2-1 ~]# cat /etc/modprobe.d/ib_ipoib.conf
# install ib_ipoib modprobe --ignore-install ib_ipoib &&
/sbin/sysctl_perf_tuning load
# remove ib_ipoib /sbin/sysctl_perf_tuning unload ; modprobe -r
--ignore-remove ib_ipoib
alias netdev-ib0 ib_ipoib
alias netdev-ib1 ib_ipoib
alias netdev-ib2 ib_ipoib
alias netdev-ib3 ib_ipoib
alias netdev-ib4 ib_ipoib
alias netdev-ib5 ib_ipoib
Could this explain the mac address issue??
There are also differences in two files under /etc/infiniband:
# diff /etc/infiniband/info-1.5.3 /etc/infiniband/info-2.2
4c4
< echo Kernel=2.6.32-220.13.1.el6.x86_64
---
> echo Kernel=2.6.32-431.11.2.el6.x86_64
6c6
< echo "Configure options: --with-core-mod --with-user_mad-mod
--with-user_access-mod --with-addr_trans-mod --with-mthca-mod
--with-mlx4-mod --with-mlx4_en-mod --with-mlx4_ib-mod
--with-mlx4_vnic-mod --with-nes-mod --with-qib-mod --with-ipoib-mod
--with-iser-mod --with-sdp-mod --with-srp-mod --with-rds-mod"
---
> echo "Configure options: --with-core-mod --with-user_mad-mod
--with-user_access-mod --with-addr_trans-mod --with-mthca-mod
--with-mlx4-mod --with-mlx5-mod --with-mlx4_en-mod --with-mlx4_vnic-mod
--with-cxgb3-mod --with-cxgb4-mod --with-nes-mod --with-qib-mod
--with-ipoib-mod --with-iser-mod --with-e_ipoib-mod --with-srp-mod
--with-rds-mod --with-nfsrdma-mod"
# diff /etc/infiniband/openib.conf-1.5.3 /etc/infiniband/openib.conf-2.2
8c8,20
< NODE_DESC_TIME_BEFORE_UPDATE=10
---
> NODE_DESC_TIME_BEFORE_UPDATE=20
>
> # Set rx_channels/tx_channels to 1 to disable IPoIB RSS/TSS
> SET_IPOIB_CHANNELS=no
>
> # Run /usr/sbin/mlnx_affinity
> RUN_AFFINITY_TUNER=no
>
> # Load UMAD module
> UMAD_LOAD=yes
>
> # Load UVERBS module
> UVERBS_LOAD=yes
11c23
< UCM_LOAD=no
---
> UCM_LOAD=yes
26c38
< MTHCA_LOAD=yes
---
> MTHCA_LOAD=no
33a46,48
> # Load MLX5 modules
> MLX5_LOAD=yes
>
39a55,60
> # Load CXGB3 modules
> CXGB3_LOAD=no
>
> # Load CXGB4 modules
> CXGB4_LOAD=no
>
41c62
< NES_LOAD=yes
---
> NES_LOAD=no
47c68,71
< SET_IPOIB_CM=yes
---
> SET_IPOIB_CM=auto
>
> # Load E_IPoIB
> E_IPOIB_LOAD=no
49,50d72
< # Load SDP module
< SDP_LOAD=no
55,57d76
< # Load ISER module
< ISER_LOAD=no
<
Could any of this matter? Maybe the SET_IPOIB_CM=yes/auto? Or the
MLX5_LOAD=yes?
Thanks,
Txema
More information about the Users
mailing list