[Academic.hsir] Problem with SLES11 + OFED + mlnx
Charles Wright
charles at asc.edu
Thu Oct 15 07:16:12 PDT 2009
Hello,
I just got some new cluster hardware and could not get SLES11 and OFED
to work on it.
With SLES 11 I tried 3 different OFED configurations.
(My new cluster hardware has Nehalem processors and I'd be nice to make
use of the SLES 11 power management features of nehalem CPUs.)
Here's my System's info:
Compute nodes:
http://www.supermicro.com/products/system/2U/6026/SYS-6026TT-IBXF.cfm
Which has an integrated Mellanox Technologies MT26418 [ConnectX IB DDR,
PCIe 2.0 5GT/s] (rev a0)
*Configuration 1.* SLES 11 + the OFED that comes with SLES11 (this is
OFED 1.4.0)
I had to set options mlx4_core msi_x=0 in /etc/modprobe.conf.local to
even get the mlx4 module to load.
I found the advice for that here:
http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1254161827534+28353475&threadId=1361415
(Note that Under 1.4.2 and 1.5-Beta the modules load fine without
mlx4_core msi_x=0 being set)
After getting the module loaded, OpenMPI jobs error out with this
runtime error
error polling HP CQ with -2 errno says Success
I've tried 2 different IB switches and multiple sets of nodes all on one
switch or the other to eliminate the hardware. (IPoIB pings work and
IB switches are error free)
I've tried both OpenMPI v1.3.3 and v1.2.9 and get the same errors.
*Configuration 2.* SLES 11 + OFED 1.4.2
The systems hang and the GigE network stops working and I have to power
cycle nodes to login.
No "error polling HP CQ with -2 errno says Success" errors seen ;
sometimes systems will stay up long enough to run an MPI job.
*Configuration 3. * SLES 11 + OFED 1.5-beta
The systems hang and the GigE network stops working and I have to power
cycle nodes to login.
No "error polling HP CQ with -2 errno says Success" errors seen ;
sometimes systems will stay up long enough to run an MPI job.
*Configuration 4.* SLES 10 + OFED 1.4.2
System is stable and runs OpenMPI jobs without errors!
This raises some questions.
SLES11 seems stable until I add in OFED... So is OFED to blame?
Since OFED 1.4.2 works and is stable on SLES 10 why does SLES 11 + OFED
1.4.2 result in an instable system. So is SLES 11 to blame?
Perhaps the power management in SLES 11 plus OFED has something to do
with the problem.
For now I'm just going to stick with SLES10 but I thought I'd post this
and hope that at some time in the future I can have a stable SLES11 +
OFED combination.
More information about the Academic.hsir
mailing list