[openib-general] IB + Dual-processor, dual-core Opteron + PCI-E

Charles Taylor taylor at hpc.ufl.edu
Thu Apr 20 07:09:08 PDT 2006


	
We have  202 node cluster where each node is configured as follows...

dual-processor, dual-core Opteron 275
Asus K8N-DRE Motherboard
TopSpin/Cisco LionCub HCA in a 16x PCI-E slot
4 GB RAM (DDR 400)

IB fabric is two-tiered fat tree with 14 Cisco 7000 switches  on the  
edge and Cisco 7008s (2) in the
first tier.

We can scale HPL runs up to about 136 nodes/544 cpus reliably on any  
set of nodes.   Above that
number of nodes/processors, our HPL runs begin to fail residuals.     
We can run across all 202 nodes
successfully if we use only two procs/node but 4 procs/node will  
*always* fail residuals.   It feels like
a data corruption issue in the IB stack.

We have tried various combinations of the following software.

Kernel: 2.6.9-22, 2.6.9-34
IB stack: topspin 3.2.0b82, OpenIB (IBGD 1.8.2)
MPI: mvapich 092/095 (topspin), mvapich 096 (osu), OpenMPI 1.0.2
Blas Libs: Goto 1.00, 1.02, ACM 3.0.0

The result is the same in every case.   We seem to be able run HPL  
reliably up to about 544 - 548 processors.  It doesn't
matter whether we run one mpi task per processor or 1 mpi task per  
node with OMP_NUM_THREADS=4.   The result
is always failed HPL residuals when we run across any subset of the  
cluster above about 136 nodes using all four procs.

I'm wondering if anyone knows of any other large IB clusters using  
dual-processor, dual-core Opterons + PCI-E with more
than 136 nodes and if so, have they been able to successfully scale  
MPI apps across their entire cluster?

Charlie Taylor
UF HPC Center




More information about the general mailing list