[ofa-general] OpenIB-cma: DAT_INSUFFICIENT_RESOURCES

Kilian CAVALOTTI kilian at stanford.edu
Tue Apr 3 11:22:52 PDT 2007


Hi all,

I'm not sure if that's the right place to ask, but I encounter a few issues 
trying to run Linpack runs on an Infiniband cluster using OFED 1.1

I'm using Intel MPI 3.0 to compile and run HPL, and upon execution, I got 
the following error messages:

*******************************************************************************
$ mpiexec -n 16 -env I_MPI_DEVICE rdssm ./xhpl
============================================================================
HPLinpack 1.0a  --  High-Performance Linpack benchmark  --   January 20, 2004
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK
============================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   40000
NB     :     112
PMAP   : Row-major process mapping
P      :       4
Q      :       4
PFACT  :    Left
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :   1ring
DEPTH  :       0
SWAP   : Mix (threshold = 256)
L1     : no-transposed form
U      : no-transposed form
EQUIL  : no
ALIGN  : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
   1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
   2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
   3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be          1.110223e-16
- Computational tests pass if scaled residuals are less than           16.0

register failed 196608 [0] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES:

register failed 196608 [4] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES:

register failed 196608 [12] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES:

register failed 196608 [8] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES:

rank 4 in job 11  node-9-1_42298   caused collective abort of all ranks
  exit status of rank 4: killed by signal 9
*******************************************************************************

The same "register failed 196608 [0] error(0x30000): OpenIB-cma: DAT_INSUFFICIENT_RESOURCES:" 
error occurs when using rdma as I_MPI_DEVICE. I don't even know if it's really an 
OpenIB issue or more a MPI issue, but the basic multi-nodes MPI tests I ran ('Hello
world' based) seemed to work fine.

I would really appreciate any hint on this issue,

Thanks a lot,
-- 
Kilian



More information about the general mailing list