[openib-general] IB + Dual-processor, dual-core Opteron + PCI-E

Thu Apr 20 10:05:32 PDT 2006

On Thu, 2006-04-20 at 10:09 -0400, Charles Taylor wrote:
> 	
> We have  202 node cluster where each node is configured as follows...
> 
> dual-processor, dual-core Opteron 275
> Asus K8N-DRE Motherboard
> TopSpin/Cisco LionCub HCA in a 16x PCI-E slot
> 4 GB RAM (DDR 400)
> 
> IB fabric is two-tiered fat tree with 14 Cisco 7000 switches  on the  
> edge and Cisco 7008s (2) in the
> first tier.
> 
> We can scale HPL runs up to about 136 nodes/544 cpus reliably on any  
> set of nodes.   Above that
> number of nodes/processors, our HPL runs begin to fail residuals.     
> We can run across all 202 nodes
> successfully if we use only two procs/node but 4 procs/node will  
> *always* fail residuals.   It feels like
> a data corruption issue in the IB stack.
> 
> We have tried various combinations of the following software.
> 
> Kernel: 2.6.9-22, 2.6.9-34
> IB stack: topspin 3.2.0b82, OpenIB (IBGD 1.8.2)
> MPI: mvapich 092/095 (topspin), mvapich 096 (osu), OpenMPI 1.0.2
> Blas Libs: Goto 1.00, 1.02, ACM 3.0.0
> 
> The result is the same in every case.   We seem to be able run HPL  
> reliably up to about 544 - 548 processors.  It doesn't
> matter whether we run one mpi task per processor or 1 mpi task per  
> node with OMP_NUM_THREADS=4.   The result
> is always failed HPL residuals when we run across any subset of the  
> cluster above about 136 nodes using all four procs.
> 
> I'm wondering if anyone knows of any other large IB clusters using  
> dual-processor, dual-core Opterons + PCI-E with more
> than 136 nodes and if so, have they been able to successfully scale  
> MPI apps across their entire cluster?
> 
  Charles,

     If you see the problem after trying various combinations of the
software you listed above, then it's likely a hardware issue.

     I know of several ~256 node dual-proc dual-core Opteron IB clusters
that are running linpack.  I've heard there can be issues with "silent"
data corruption on the Opteron CPUs if they get too hot.  Are you
monitoring the node/cpu temps?  If CPU temp is an issue you should see a
problem whether you are running a single linpack across all 202 nodes,
or running simultaneous smaller linpacks (say 4 50 node runs).  I'll see
if I can find the bug report for this problem.

  - Matt