[openib-general] openmpi on ib over pathscale (2.6.17-rc4+53 patches)
Roger Heflin
rheflin at atipa.com
Thu May 18 08:51:02 PDT 2006
Hello,
I have been doing some testing with hpl over openmpi over ib on a pathscale
card. Previously I had ran the same code over no-mem non-pathscale
cards with
no apparent issues.
Current xhpl starts and runs for a while but gets some odd error
messages (this
is an improvement with the 53 patches-as before it kernel crashed the second
machine everytime on startup). There do appear to be some odd issues upon
startup (xhpl won't start initially-hangs forever-but if if "ifdown/up
ib0" on
both machines then restart it will startup).
Here is what the run looks like:
[0,1,3][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,3][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,1][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,1][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,0][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,0][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,2][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,2][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
ibv_create_qp: returned 0 byte(s) for max inline dat
a
============================================================================
HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20,
2004
Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK
============================================================================
An explanation of the input/output parameters follows:
[0,1,3][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] T/V
: Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
ibv_create_qp: returned 0 byte(s) for max inline data
[0,1,3][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
ibv_create_qp: returned 0 byte(s) for max inline dat
a
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 20480 28672 40960
NB : 64 80 96 112 120 128 136
144
152 160 240 288
PMAP : Row-major process mapping
P : 2
Q : 2
PFACT : Left Crout
NBMIN : 2 4
NDIV : 2
RFACT : Left Crout
BCAST : 1ring 1ringM 2ring 2ringM Blong BlongM
DEPTH : 0
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
----------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1 * N )
2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
[0,1,0][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,0][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp]
ibv_create_qp: returned 0 byte(s) for max inline dat
a
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2L2 20480 64 2 2 208.52 2.747e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0190455 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0054105 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0010075 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2L4 20480 64 2 2 208.51 2.747e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0206957 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0058793 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0010948 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2C2 20480 64 2 2 210.67 2.719e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0190455 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0054105 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0010075 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
WR00L2C4 20480 64 2 2 206.97 2.767e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0191957 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0054531 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0010154 ...... PASSED
[0,1,0][btl_openib_component.c:722:mca_btl_openib_component_progress]
error polling LP CQ with status 12 for wr_id 471123
48205468 opcode 0
[0,1,0][btl_openib_component.c:722:mca_btl_openib_component_progress]
error polling LP CQ with status 5 for wr_id 4711234
8205468 opcode 0
[0,1,0][btl_openib_component.c:722:mca_btl_openib_component_progress]
error polling LP CQ with status 5 for wr_id 4711234
8271288 opcode 0
At this point in time it appears to be hung. If it restart and re-run it
will hang at some different point. The previous run hung after the first
step, this run made it 4 steps. All processes are still running (and
using cpu)
but no output is any longer being returned. ctrl-c will stop it and
does stop
the processes on both nodes. Rebooting both nodes and starting clean
does not
seem to change any behavior. The above 3 messages always appear at
the time
that the hang appears to happen, so they do appear to be related.
IP over open ib appears to still be pingable, so IB is still up, there
are no abnormal messages in dmesg/messages on either of the 2 machines being
used.
I appear to be able to duplicate this, and I can collect any information
that would
help when the hang happens. From a clean reboot it appears to last
longer,
but in the end the messages look much the same.
Roger
More information about the general
mailing list