[libfabric-users] FI_EP_MSG on cray

Biddiscombe, John A. biddisco at cscs.ch
Thu Feb 16 22:19:12 PST 2017


Howard,

Thanks for this info. HPX does its own thread->core binding so we never use the slurm settings, but I’ll bear this in mind if I discover problems and I’ll make sure I use the --cpus-per-task option - I should only ever 1 process per node. I will know within a few days if our stuff runs properly as my first prototype is almost ready …

Yours

JB

From: Howard Pritchard <hppritcha at gmail.com>
Date: Thursday, 16 February 2017 at 23:30
To: John Biddiscombe <biddisco at cscs.ch>
Cc: Sung-Eun Choi <sungeun at cray.com>, "libfabric-users at lists.openfabrics.org" <libfabric-users at lists.openfabrics.org>
Subject: Re: [libfabric-users] FI_EP_MSG on cray

Hi John,

Okay I figured out the problem.  I do not know if this will be important for your HPX work.
Basically the way SLURM is configured at NERSC, and apparently at CSCS is that
unless you suggest otherwise, each process launched by srun only gets 1/(total number of cores on node)  network resources (Aries FMA descriptors, etc.).  The Cray internal systems apparently aren't
configured this way.  This results in the aborts in the GNI unit tests you were seeing.

A workaround for that is to add the following to the run_gnitest script:


    args="-N1 --exclusive --cpu_bind=none -t00:20:00 --ntasks=1 --cpus-per-task=X"



where X is the number of cores on the nodes of piz daint.

The tests that are failing exercise are using multi FMA descriptors per process as they test support for scalable endpoints and shared tx contexts.  So, if HPX is going to use either of these libfabric constructs, you will need to remember this --cpus-per-task SLURM argument.

I'll update the running criterion tests wiki.

Thanks,

Howard



2017-02-16 14:29 GMT-07:00 Howard Pritchard <hppritcha at gmail.com<mailto:hppritcha at gmail.com>>:
HI John,

I'm seeing this same problem at NERSC/edison.  I'll use that system to debug this problem.

Howard


2017-02-15 13:40 GMT-07:00 Biddiscombe, John A. <biddisco at cscs.ch<mailto:biddisco at cscs.ch>>:
Sung

just fyi : I checked out the v2.2.0 branch of criterion and recompiled it and libfabric  and got broadly the same results, slightly different number of fails, but the same pattern.

daint103:/scratch/snx3000/biddisco/src/libfabric-cray (master *=)$ ~/apps/libfabric/bin/run_gnitest
[----] Warning! The test `api_cq::msg_send_only` crashed during its setup or teardown.
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
[----] prov/gni/test/cm.c:203: Assertion failed: fi_endpoint
[FAIL] cm_basic::srv_setup: (0.44s)
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::inject` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::inject_write` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::inject_write_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::inject_writedata` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::inject_writedata_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::read` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::read_alignment` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::readmsg` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::readv` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::write` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::write_alignment` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::write_alignment_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::write_autoreg` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::write_autoreg_uncached` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::write_error` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::write_fence` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::write_fence_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::write_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::writedata` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::writedata_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::writemsg` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::writemsg_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::writev` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `dgram_rma_stx::writev_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::inject` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::inject_write` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::inject_write_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::inject_writedata` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::inject_writedata_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::read` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::read_alignment` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::read_alignment_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::read_error` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::read_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::readmsg` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::readmsg_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::readv` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::readv_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::trigger` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::write` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::write_alignment` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::write_alignment_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::write_autoreg` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::write_autoreg_uncached` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::write_error` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::write_fence` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::write_fence_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::write_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::writedata` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::writedata_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::writemsg` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::writemsg_retrans` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::writev` crashed during its setup or teardown.
[----] prov/gni/test/rdm_dgram_stx.c:165: Assertion failed: fi_endpoint
[----] Warning! The test `rdm_rma_stx::writev_retrans` crashed during its setup or teardown.
[----] prov/gni/test/sep.c:2343: Assertion failed: fi_scalable_ep
[FAIL] scalable::av_insert: (0.46s)
[----] prov/gni/test/sep.c:177: Assertion failed: fi_scalable_ep
[----] Warning! The test `scalablem::all` crashed during its setup or teardown.
[----] prov/gni/test/sep.c:177: Assertion failed: fi_scalable_ep
[----] Warning! The test `scalablem::misc` crashed during its setup or teardown.
[----] prov/gni/test/sep.c:177: Assertion failed: fi_scalable_ep
[----] Warning! The test `scalablet::all` crashed during its setup or teardown.
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
[----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
[----] Warning! The test `vc_conn_ping_auto::ep_connect_inter_cm` crashed during its setup or teardown.
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
[----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
[----] Warning! The test `vc_conn_ping_auto::ep_connect_inter_cm_pp` crashed during its setup or teardown.
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
[----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
[----] Warning! The test `vc_conn_ping_auto::ep_connect_intra_cm` crashed during its setup or teardown.
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
[----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
[----] Warning! The test `vc_conn_ping_auto::ep_connect_intra_cm_pp` crashed during its setup or teardown.
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
[----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
[----] Warning! The test `vc_conn_ping_auto::ep_connect_self` crashed during its setup or teardown.
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
[----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
[----] Warning! The test `vc_conn_ping_auto::ep_connect_self_pp` crashed during its setup or teardown.
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
[----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
[----] Warning! The test `vc_conn_ping_manual::ep_connect_inter_cm_pp` crashed during its setup or teardown.
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
[----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
[----] Warning! The test `vc_conn_ping_manual::ep_connect_intra_cm` crashed during its setup or teardown.
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
[----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
[----] Warning! The test `vc_conn_ping_manual::ep_connect_intra_cm_pp` crashed during its setup or teardown.
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
[----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
[----] Warning! The test `vc_conn_ping_manual::ep_connect_self` crashed during its setup or teardown.
Unidentified node: Error detected by libibgni.so.  Subsequent operation may be unreliable.  IAA did not recognize this as an MPI process
[----] prov/gni/test/vc.c:271: Assertion failed: fi_endpoint
[----] Warning! The test `vc_conn_ping_manual::ep_connect_self_pp` crashed during its setup or teardown.
[====] Synthesis: Tested: 631 | Passing: 561 | Failing: 70 | Crashing: 68


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20170217/fcbc03b1/attachment.html>


More information about the Libfabric-users mailing list