[libfabric-users] FI_EP_MSG on cray

Howard Pritchard hppritcha at gmail.com
Thu Feb 16 14:30:29 PST 2017


Hi John,

Okay I figured out the problem.  I do not know if this will be important
for your HPX work.
Basically the way SLURM is configured at NERSC, and apparently at CSCS is
that
unless you suggest otherwise, each process launched by srun only gets
1/(total number of cores on node)  network resources (Aries FMA
descriptors, etc.).  The Cray internal systems apparently aren't
configured this way.  This results in the aborts in the GNI unit tests you
were seeing.

A workaround for that is to add the following to the run_gnitest script:

    args="-N1 --exclusive --cpu_bind=none -t00:20:00 --ntasks=1
--cpus-per-task=X"


where X is the number of cores on the nodes of piz daint.

The tests that are failing exercise are using multi FMA descriptors per
process as they test support for scalable endpoints and shared tx
contexts.  So, if HPX is going to use either of these libfabric constructs,
you will need to remember this --cpus-per-task SLURM argument.

I'll update the running criterion tests wiki.

Thanks,

Howard



2017-02-16 14:29 GMT-07:00 Howard Pritchard <hppritcha at gmail.com>:

> HI John,
>
> I'm seeing this same problem at NERSC/edison.  I'll use that system to
> debug this problem.
>
> Howard
>
>
> 2017-02-15 13:40 GMT-07:00 Biddiscombe, John A. <biddisco at cscs.ch>:
>
>> Sung
>>
>> just fyi : I checked out the v2.2.0 branch of criterion and recompiled it
>> and libfabric  and got broadly the same results, slightly different number
>> of fails, but the same pattern.
>>
>> daint103:/scratch/snx3000/biddisco/src/libfabric-cray (master *=)$
>> ~/apps/libfabric/bin/run_gnitest
>> [----] Warning! The test `api_cq::msg_send_only` crashed during its setup
>> or teardown.
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> [----] prov/gni/test/cm.c:203: Assertion failed: fi_endpoint
>> [FAIL] cm_basic::srv_setup: (0.44s)
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::inject` crashed during its setup
>> or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::inject_write` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::inject_write_retrans` crashed
>> during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::inject_writedata` crashed
>> during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::inject_writedata_retrans`
>> crashed during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::read` crashed during its setup
>> or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::read_alignment` crashed during
>> its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::readmsg` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::readv` crashed during its setup
>> or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::write` crashed during its setup
>> or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::write_alignment` crashed during
>> its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::write_alignment_retrans`
>> crashed during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::write_autoreg` crashed during
>> its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::write_autoreg_uncached` crashed
>> during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::write_error` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::write_fence` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::write_fence_retrans` crashed
>> during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::write_retrans` crashed during
>> its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::writedata` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::writedata_retrans` crashed
>> during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::writemsg` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::writemsg_retrans` crashed
>> during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::writev` crashed during its setup
>> or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `dgram_rma_stx::writev_retrans` crashed during
>> its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::inject` crashed during its setup
>> or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::inject_write` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::inject_write_retrans` crashed
>> during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::inject_writedata` crashed during
>> its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::inject_writedata_retrans` crashed
>> during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::read` crashed during its setup or
>> teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::read_alignment` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::read_alignment_retrans` crashed
>> during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::read_error` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::read_retrans` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::readmsg` crashed during its setup
>> or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::readmsg_retrans` crashed during
>> its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::readv` crashed during its setup or
>> teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::readv_retrans` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::trigger` crashed during its setup
>> or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::write` crashed during its setup or
>> teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::write_alignment` crashed during
>> its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::write_alignment_retrans` crashed
>> during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::write_autoreg` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::write_autoreg_uncached` crashed
>> during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::write_error` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::write_fence` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::write_fence_retrans` crashed
>> during its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::write_retrans` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::writedata` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::writedata_retrans` crashed during
>> its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::writemsg` crashed during its setup
>> or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::writemsg_retrans` crashed during
>> its setup or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::writev` crashed during its setup
>> or teardown.
>> [----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
>> [----] Warning! The test `rdm_rma_stx::writev_retrans` crashed during its
>> setup or teardown.
>> [----] prov/gni/test/sep.c:2343: Assertion failed: fi_scalable_ep
>> [FAIL] scalable::av_insert: (0.46s)
>> [----] prov/gni/test/sep.c:177: Assertion failed: fi_scalable_ep
>> [----] Warning! The test `scalablem::all` crashed during its setup or
>> teardown.
>> [----] prov/gni/test/sep.c:177: Assertion failed: fi_scalable_ep
>> [----] Warning! The test `scalablem::misc` crashed during its setup or
>> teardown.
>> [----] prov/gni/test/sep.c:177: Assertion failed: fi_scalable_ep
>> [----] Warning! The test `scalablet::all` crashed during its setup or
>> teardown.
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> [----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
>> [----] Warning! The test `vc_conn_ping_auto::ep_connect_inter_cm`
>> crashed during its setup or teardown.
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> [----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
>> [----] Warning! The test `vc_conn_ping_auto::ep_connect_inter_cm_pp`
>> crashed during its setup or teardown.
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> [----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
>> [----] Warning! The test `vc_conn_ping_auto::ep_connect_intra_cm`
>> crashed during its setup or teardown.
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> [----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
>> [----] Warning! The test `vc_conn_ping_auto::ep_connect_intra_cm_pp`
>> crashed during its setup or teardown.
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> [----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
>> [----] Warning! The test `vc_conn_ping_auto::ep_connect_self` crashed
>> during its setup or teardown.
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> [----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
>> [----] Warning! The test `vc_conn_ping_auto::ep_connect_self_pp` crashed
>> during its setup or teardown.
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> [----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
>> [----] Warning! The test `vc_conn_ping_manual::ep_connect_inter_cm_pp`
>> crashed during its setup or teardown.
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> [----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
>> [----] Warning! The test `vc_conn_ping_manual::ep_connect_intra_cm`
>> crashed during its setup or teardown.
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> [----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
>> [----] Warning! The test `vc_conn_ping_manual::ep_connect_intra_cm_pp`
>> crashed during its setup or teardown.
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> [----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
>> [----] Warning! The test `vc_conn_ping_manual::ep_connect_self` crashed
>> during its setup or teardown.
>> Unidentified node: Error detected by libibgni.so.  Subsequent operation
>> may be unreliable.  IAA did not recognize this as an MPI process
>> [----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
>> [----] Warning! The test `vc_conn_ping_manual::ep_connect_self_pp`
>> crashed during its setup or teardown.
>> [====] Synthesis: Tested: 631 | Passing: 561 | Failing: 70 | Crashing: 68
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20170216/28d6332c/attachment.html>


More information about the Libfabric-users mailing list