[openib-general] openmpi on ib over pathscale (2.6.17-rc4+53 patches)

Roger Heflin rheflin at atipa.com
Thu May 18 08:51:02 PDT 2006


Hello,

I have been doing some testing with hpl over openmpi over ib on a pathscale
card.  Previously I had ran the same code over no-mem non-pathscale 
cards with
no apparent issues.

Current xhpl starts and runs for a while but gets some odd error 
messages (this
is an improvement with the 53 patches-as before it kernel crashed the second
machine everytime on startup).   There do appear to be some odd issues upon
startup (xhpl won't start initially-hangs forever-but if if "ifdown/up 
ib0" on
both machines then restart it will startup).

Here is what the run looks like:

[0,1,3][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] 
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,3][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] 
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,1][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] 
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,1][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] 
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,0][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] 
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,0][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] 
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,2][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] 
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,2][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] 
ibv_create_qp: returned 0 byte(s) for max inline dat
a
============================================================================
HPLinpack 1.0a  --  High-Performance Linpack benchmark  --   January 20, 
2004
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK
============================================================================

An explanation of the input/output parameters follows:
[0,1,3][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] T/V 
    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
ibv_create_qp: returned 0 byte(s) for max inline data
[0,1,3][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] 
ibv_create_qp: returned 0 byte(s) for max inline dat
a
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   20480    28672    40960
NB     :      64       80       96      112      120      128      136 
     144
              152      160      240      288
PMAP   : Row-major process mapping
P      :       2
Q      :       2
PFACT  :    Left    Crout
NBMIN  :       2        4
NDIV   :       2
RFACT  :    Left    Crout
BCAST  :   1ring   1ringM    2ring   2ringM    Blong   BlongM
DEPTH  :       0
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be          1.110223e-16
- Computational tests pass if scaled residuals are less than           16.0

[0,1,0][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] 
ibv_create_qp: returned 0 byte(s) for max inline dat
a
[0,1,0][btl_openib_endpoint.c:889:mca_btl_openib_endpoint_create_qp] 
ibv_create_qp: returned 0 byte(s) for max inline dat
a
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR00L2L2       20480    64     2     2             208.52          2.747e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0190455 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0054105 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0010075 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR00L2L4       20480    64     2     2             208.51          2.747e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0206957 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0058793 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0010948 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR00L2C2       20480    64     2     2             210.67          2.719e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0190455 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0054105 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0010075 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
WR00L2C4       20480    64     2     2             206.97          2.767e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0191957 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0054531 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0010154 ...... PASSED
[0,1,0][btl_openib_component.c:722:mca_btl_openib_component_progress] 
error polling LP CQ with status 12 for wr_id 471123
48205468 opcode 0
[0,1,0][btl_openib_component.c:722:mca_btl_openib_component_progress] 
error polling LP CQ with status 5 for wr_id 4711234
8205468 opcode 0
[0,1,0][btl_openib_component.c:722:mca_btl_openib_component_progress] 
error polling LP CQ with status 5 for wr_id 4711234
8271288 opcode 0

At this point in time it appears to be hung.    If it restart and re-run it
will hang at some different point.  The previous run hung after the first
step, this run made it 4 steps.   All processes are still running (and 
using cpu)
but no output is any longer being returned.  ctrl-c will stop it and 
does stop
the processes on both nodes.   Rebooting both nodes and starting clean 
does not
seem to change any behavior.    The above 3 messages always appear at 
the time
that the hang appears to happen, so they do appear to be related.

IP over open ib appears to still be pingable, so IB is still up, there
are no abnormal messages in dmesg/messages on either of the 2 machines being
used.

I appear to be able to duplicate this, and I can collect any information 
that would
help when the hang happens.    From a clean reboot it appears to last 
longer,
but in the end the messages look much the same.

                                        Roger



More information about the general mailing list