From charles at asc.edu Thu Oct 15 07:16:12 2009 From: charles at asc.edu (Charles Wright) Date: Thu, 15 Oct 2009 09:16:12 -0500 Subject: [Academic.hsir] Problem with SLES11 + OFED + mlnx Message-ID: <4AD72EAC.6000109@asc.edu> Hello, I just got some new cluster hardware and could not get SLES11 and OFED to work on it. With SLES 11 I tried 3 different OFED configurations. (My new cluster hardware has Nehalem processors and I'd be nice to make use of the SLES 11 power management features of nehalem CPUs.) Here's my System's info: Compute nodes: http://www.supermicro.com/products/system/2U/6026/SYS-6026TT-IBXF.cfm Which has an integrated Mellanox Technologies MT26418 [ConnectX IB DDR, PCIe 2.0 5GT/s] (rev a0) *Configuration 1.* SLES 11 + the OFED that comes with SLES11 (this is OFED 1.4.0) I had to set options mlx4_core msi_x=0 in /etc/modprobe.conf.local to even get the mlx4 module to load. I found the advice for that here: http://forums11.itrc.hp.com/service/forums/questionanswer.do?admit=109447626+1254161827534+28353475&threadId=1361415 (Note that Under 1.4.2 and 1.5-Beta the modules load fine without mlx4_core msi_x=0 being set) After getting the module loaded, OpenMPI jobs error out with this runtime error error polling HP CQ with -2 errno says Success I've tried 2 different IB switches and multiple sets of nodes all on one switch or the other to eliminate the hardware. (IPoIB pings work and IB switches are error free) I've tried both OpenMPI v1.3.3 and v1.2.9 and get the same errors. *Configuration 2.* SLES 11 + OFED 1.4.2 The systems hang and the GigE network stops working and I have to power cycle nodes to login. No "error polling HP CQ with -2 errno says Success" errors seen ; sometimes systems will stay up long enough to run an MPI job. *Configuration 3. * SLES 11 + OFED 1.5-beta The systems hang and the GigE network stops working and I have to power cycle nodes to login. No "error polling HP CQ with -2 errno says Success" errors seen ; sometimes systems will stay up long enough to run an MPI job. *Configuration 4.* SLES 10 + OFED 1.4.2 System is stable and runs OpenMPI jobs without errors! This raises some questions. SLES11 seems stable until I add in OFED... So is OFED to blame? Since OFED 1.4.2 works and is stable on SLES 10 why does SLES 11 + OFED 1.4.2 result in an instable system. So is SLES 11 to blame? Perhaps the power management in SLES 11 plus OFED has something to do with the problem. For now I'm just going to stick with SLES10 but I thought I'd post this and hope that at some time in the future I can have a stable SLES11 + OFED combination.