[libfabric-users] Test results on some machines
Biddiscombe, John A.
biddisco at cscs.ch
Mon Feb 13 02:59:10 PST 2017
PS. When I run the i_pingpong test using ‘srun blah blah’ it works, but if I ssh into the two nodes and run the instances by hand - it does not.
Does this mean certain SLURM or other env vars are needed.
Sorry for posting so many questions etc.
JB
From: Libfabric-users [mailto:libfabric-users-bounces at lists.openfabrics.org] On Behalf Of Biddiscombe, John A.
Sent: 13 February 2017 11:43
To: libfabric-users at lists.openfabrics.org
Subject: [libfabric-users] Test results on some machines
I’m slightly troubled by the results I’ve got on 3 machines, the verbs provider seems to work, though the number of not-run tests is disturbing. The gni provider seems terrible, but when I run fi_pingpong from the libfabric build (not one the fabtests), I can get it working on gni, but not with verbs.
Oddly the mem reg test on gni fails flat out (when run by hand on a compute node)
/users/biddisco/apps/fabtests/bin/fi_mr_test
Testing MR on fabric gni
Running mr_reg [Test fi_mr_reg for various buffer sizes]...FAIL: fi_mr_reg failed: ret=12 (Cannot allocate memory)
Running mr_regv [Test fi_mr_regv]...FAIL: fi_mr_regv failed: ret=12 (Cannot allocate memory)
Running mr_regattr [Test fi_mr_regattr]...FAIL: fi_mr_regattr failed: ret=22 (Invalid argument)
Summary: 3 tests failed
I’ve no idea how the fi_pingpong test manages to work (see end of email for putput) in light of that fail (I presume it uses the mem reg)
I was hoping that I’d find at least one test that worked on all 3 machines so that I’d have confidence that I could use it as a template to work from. It seems not to be so easy.
Can anyone shed light on these results and possibly give advice on how to improve the gni behaviour?
NB. if I enable extra output using –vvv, the basic problem with gni seems to be every fails happens due to
fi_getinfo(): common/shared.c:454, ret=-61 (No data available)
or
fi_mr_reg(): util/pingpong.c:1317, ret=-12 (Cannot allocate memory)
NB2. I tried the gni tests with nid00411 type names instead of ip addresses just in case, but it did not help.
Thanks
JB
In each case I’ve allocated a couple of nodes and found the correct ip addresses for the fabric)
# --------------------------------------------------------------
Generic cluster with infiniband support
# --------------------------------------------------------------
$HOME/apps/fabtests/bin/runfabtests.sh -p $HOME/apps/fabtests/bin verbs 192.168.3.36 192.168.3.38
# Test Result
# --------------------------------------------------------------
fi_getinfo_test -p verbs: Pass
fi_av_test -g 192.168.10.1 -n 1 -s 192.168.3.36 -p verbs: Pass
fi_dom_test -n 2 -p verbs: Pass
fi_eq_test -p verbs: Pass
fi_cq_test -p verbs: Pass
fi_mr_test -p verbs: Pass
fi_size_left_test -p verbs: Pass
fi_dgram g00n13s -p verbs: Pass
fi_rdm g00n13s -p verbs: Pass
fi_msg g00n13s -p verbs: Pass
fi_cm_data -p verbs: Pass
fi_cq_data -p verbs: Pass
fi_dgram -p verbs: Notrun
fi_dgram_waitset -p verbs: Notrun
fi_msg -p verbs: Pass
fi_msg_epoll -p verbs: Pass
fi_msg_sockets -p verbs: Pass
fi_poll -t queue -p verbs: Notrun
fi_poll -t counter -p verbs: Notrun
fi_rdm -p verbs: Pass
fi_rdm_rma_simple -p verbs: Notrun
fi_rdm_rma_trigger -p verbs: Notrun
fi_shared_ctx -p verbs: Notrun
fi_shared_ctx --no-tx-shared-ctx -p verbs: Notrun
fi_shared_ctx --no-rx-shared-ctx -p verbs: Notrun
fi_shared_ctx -e msg -p verbs: Notrun
fi_shared_ctx -e msg --no-tx-shared-ctx -p verbs: Pass
fi_shared_ctx -e msg --no-rx-shared-ctx -p verbs: Notrun
fi_shared_ctx -e dgram -p verbs: Notrun
fi_shared_ctx -e dgram --no-tx-shared-ctx -p verbs: Notrun
fi_shared_ctx -e dgram --no-rx-shared-ctx -p verbs: Notrun
fi_rdm_tagged_peek -p verbs: Pass
fi_scalable_ep -p verbs: Notrun
fi_cmatose -p verbs: Pass
fi_rdm_shared_av -p verbs: Notrun
fi_msg_pingpong -I 5 -p verbs: Pass
fi_msg_bw -I 5 -p verbs: Notrun
fi_rma_bw -e msg -o write -I 5 -p verbs: Notrun
fi_rma_bw -e msg -o read -I 5 -p verbs: Notrun
fi_rma_bw -e msg -o writedata -I 5 -p verbs: Notrun
fi_rma_bw -e rdm -o write -I 5 -p verbs: Notrun
fi_rma_bw -e rdm -o read -I 5 -p verbs: Notrun
fi_rma_bw -e rdm -o writedata -I 5 -p verbs: Notrun
fi_msg_rma -o write -I 5 -p verbs: Pass
fi_msg_rma -o read -I 5 -p verbs: Pass
fi_msg_rma -o writedata -I 5 -p verbs: Pass
fi_msg_stream -I 5 -p verbs: Pass
fi_rdm_atomic -I 5 -o all -p verbs: Notrun
fi_rdm_cntr_pingpong -I 5 -p verbs: Notrun
fi_rdm_multi_recv -I 5 -p verbs: Pass
fi_rdm_pingpong -I 5 -p verbs: Pass
fi_rdm_rma -o write -I 5 -p verbs: Notrun
fi_rdm_rma -o read -I 5 -p verbs: Notrun
fi_rdm_rma -o writedata -I 5 -p verbs: Notrun
fi_rdm_tagged_pingpong -I 5 -p verbs: Pass
fi_rdm_tagged_bw -I 5 -p verbs: Pass
fi_dgram_pingpong -I 5 -p verbs: Notrun
fi_rc_pingpong -n 5 -p verbs: Pass
fi_rc_pingpong -n 5 -e -p verbs: Pass
# --------------------------------------------------------------
# Total Pass 30
# Total Notrun 29
# Total Fail 0
# Percentage of Pass 100
# --------------------------------------------------------------
# --------------------------------------------------------------
cray xc40 with gni
# --------------------------------------------------------------
$HOME/apps/fabtests/bin/runfabtests.sh -p $HOME/apps/fabtests/bin gni 148.187.33.168 148.187.33.172
# Test Result
# --------------------------------------------------------------
fi_getinfo_test -p gni: Pass
fi_av_test -g 192.168.10.1 -n 1 -s 148.187.33.168 -p gni: Pass
fi_dom_test -n 2 -p gni: Pass
fi_eq_test -p gni: Pass
fi_cq_test -p gni: Pass
fi_mr_test -p gni: Fail
fi_size_left_test -p gni: Fail
fi_dgram g00n13s -p gni: Pass
fi_rdm g00n13s -p gni: Pass
fi_msg g00n13s -p gni: Pass
fi_cm_data -p gni: Fail
fi_cq_data -p gni: Fail
fi_dgram -p gni: Fail
fi_dgram_waitset -p gni: Fail
fi_msg -p gni: Fail
fi_msg_epoll -p gni: Fail
fi_msg_sockets -p gni: Fail
fi_poll -t queue -p gni: Fail
fi_poll -t counter -p gni: Fail
fi_rdm -p gni: Fail
fi_rdm_rma_simple -p gni: Notrun
fi_rdm_rma_trigger -p gni: Notrun
fi_shared_ctx -p gni: Notrun
fi_shared_ctx --no-tx-shared-ctx -p gni: Notrun
fi_shared_ctx --no-rx-shared-ctx -p gni: Fail
fi_shared_ctx -e msg -p gni: Notrun
fi_shared_ctx -e msg --no-tx-shared-ctx -p gni: Notrun
fi_shared_ctx -e msg --no-rx-shared-ctx -p gni: Fail
fi_shared_ctx -e dgram -p gni: Notrun
fi_shared_ctx -e dgram --no-tx-shared-ctx -p gni: Notrun
fi_shared_ctx -e dgram --no-rx-shared-ctx -p gni: Fail
fi_rdm_tagged_peek -p gni: Fail
fi_scalable_ep -p gni: Fail
fi_cmatose -p gni: Fail
fi_rdm_shared_av -p gni: Fail
fi_msg_pingpong -I 5 -p gni: Fail
fi_msg_bw -I 5 -p gni: Fail
fi_rma_bw -e msg -o write -I 5 -p gni: Fail
fi_rma_bw -e msg -o read -I 5 -p gni: Fail
fi_rma_bw -e msg -o writedata -I 5 -p gni: Fail
fi_rma_bw -e rdm -o write -I 5 -p gni: Fail
fi_rma_bw -e rdm -o read -I 5 -p gni: Fail
fi_rma_bw -e rdm -o writedata -I 5 -p gni: Fail
fi_msg_rma -o write -I 5 -p gni: Fail
fi_msg_rma -o read -I 5 -p gni: Fail
fi_msg_rma -o writedata -I 5 -p gni: Fail
fi_msg_stream -I 5 -p gni: Fail
fi_rdm_atomic -I 5 -o all -p gni: Fail
fi_rdm_cntr_pingpong -I 5 -p gni: Fail
fi_rdm_multi_recv -I 5 -p gni: Fail
fi_rdm_pingpong -I 5 -p gni: Fail
fi_rdm_rma -o write -I 5 -p gni: Fail
fi_rdm_rma -o read -I 5 -p gni: Fail
fi_rdm_rma -o writedata -I 5 -p gni: Fail
fi_rdm_tagged_pingpong -I 5 -p gni: Fail
fi_rdm_tagged_bw -I 5 -p gni: Fail
fi_dgram_pingpong -I 5 -p gni: Fail
fi_rc_pingpong -n 5 -p gni: Fail
fi_rc_pingpong -n 5 -e -p gni: Fail
# --------------------------------------------------------------
# Total Pass 8
# Total Notrun 8
# Total Fail 43
# Percentage of Pass 15
# --------------------------------------------------------------
# --------------------------------------------------------------
cray with omnipath and verbs
# --------------------------------------------------------------
$HOME/apps/fabtests/bin/runfabtests.sh -p $HOME/apps/fabtests/bin verbs 192.168.18.65 192.168.18.66
# Test Result
# --------------------------------------------------------------
fi_getinfo_test -p verbs: Pass
fi_av_test -g 192.168.10.1 -n 1 -s 192.168.18.65 -p verbs: Pass
fi_dom_test -n 2 -p verbs: Pass
fi_eq_test -p verbs: Pass
fi_cq_test -p verbs: Pass
fi_mr_test -p verbs: Pass
fi_size_left_test -p verbs: Pass
fi_dgram g00n13s -p verbs: Pass
fi_rdm g00n13s -p verbs: Pass
fi_msg g00n13s -p verbs: Pass
fi_cm_data -p verbs: Pass
fi_cq_data -p verbs: Pass
fi_dgram -p verbs: Notrun
fi_dgram_waitset -p verbs: Notrun
fi_msg -p verbs: Pass
fi_msg_epoll -p verbs: Pass
fi_msg_sockets -p verbs: Pass
fi_poll -t queue -p verbs: Notrun
fi_poll -t counter -p verbs: Notrun
fi_rdm -p verbs: Pass
fi_rdm_rma_simple -p verbs: Notrun
fi_rdm_rma_trigger -p verbs: Notrun
fi_shared_ctx -p verbs: Notrun
fi_shared_ctx --no-tx-shared-ctx -p verbs: Notrun
fi_shared_ctx --no-rx-shared-ctx -p verbs: Notrun
fi_shared_ctx -e msg -p verbs: Notrun
fi_shared_ctx -e msg --no-tx-shared-ctx -p verbs: Pass
fi_shared_ctx -e msg --no-rx-shared-ctx -p verbs: Notrun
fi_shared_ctx -e dgram -p verbs: Notrun
fi_shared_ctx -e dgram --no-tx-shared-ctx -p verbs: Notrun
fi_shared_ctx -e dgram --no-rx-shared-ctx -p verbs: Notrun
fi_rdm_tagged_peek -p verbs: Pass
fi_scalable_ep -p verbs: Notrun
fi_cmatose -p verbs: Pass
fi_rdm_shared_av -p verbs: Notrun
fi_msg_pingpong -I 5 -p verbs: Pass
fi_msg_bw -I 5 -p verbs: Notrun
fi_rma_bw -e msg -o write -I 5 -p verbs: Notrun
fi_rma_bw -e msg -o read -I 5 -p verbs: Notrun
fi_rma_bw -e msg -o writedata -I 5 -p verbs: Notrun
fi_rma_bw -e rdm -o write -I 5 -p verbs: Notrun
fi_rma_bw -e rdm -o read -I 5 -p verbs: Notrun
fi_rma_bw -e rdm -o writedata -I 5 -p verbs: Notrun
fi_msg_rma -o write -I 5 -p verbs: Pass
fi_msg_rma -o read -I 5 -p verbs: Pass
fi_msg_rma -o writedata -I 5 -p verbs: Pass
fi_msg_stream -I 5 -p verbs: Pass
fi_rdm_atomic -I 5 -o all -p verbs: Notrun
fi_rdm_cntr_pingpong -I 5 -p verbs: Notrun
fi_rdm_multi_recv -I 5 -p verbs: Pass
fi_rdm_pingpong -I 5 -p verbs: Pass
fi_rdm_rma -o write -I 5 -p verbs: Notrun
fi_rdm_rma -o read -I 5 -p verbs: Notrun
fi_rdm_rma -o writedata -I 5 -p verbs: Notrun
fi_rdm_tagged_pingpong -I 5 -p verbs: Pass
fi_rdm_tagged_bw -I 5 -p verbs: Pass
fi_dgram_pingpong -I 5 -p verbs: Notrun
fi_rc_pingpong -n 5 -p verbs: Pass
fi_rc_pingpong -n 5 -e -p verbs: Pass
# --------------------------------------------------------------
# Total Pass 30
# Total Notrun 29
# Total Fail 0
# Percentage of Pass 100
# --------------------------------------------------------------
# --------------------------------------------------------------
# ping pong test from libfabric on gni
# --------------------------------------------------------------
./frun.sh /users/biddisco/apps/libfabric/bin/fi_pingpong
running /users/biddisco/apps/libfabric/bin/fi_pingpong on nid00[421,425]
nid00421 is 148.187.33.168
Generated command is srun -n 2 --ntasks-per-node=1 -l --multi-prog ./scalable.conf
0 /users/biddisco/apps/libfabric/bin/fi_pingpong -p gni
1 /users/biddisco/apps/libfabric/bin/fi_pingpong -p gni nid00421
1: bytes #sent #ack total time MB/sec usec/xfer Mxfers/sec
1: 64 10k =10k 1.2m 0.05s 25.52 2.51 0.40
1: 256 10k =10k 4.8m 0.06s 85.54 2.99 0.33
1: 1k 10k =10k 19m 0.05s 454.41 2.25 0.44
1: 4k 10k =10k 78m 0.07s 1254.25 3.27 0.31
1: 64k 1k =1k 125m 0.04s 3304.97 19.83 0.05
1: 1m 100 =100 200m 0.05s 4422.23 237.12 0.00
0: bytes #sent #ack total time MB/sec usec/xfer Mxfers/sec
0: 64 10k =10k 1.2m 0.05s 25.52 2.51 0.40
0: 256 10k =10k 4.8m 0.06s 85.53 2.99 0.33
0: 1k 10k =10k 19m 0.05s 454.35 2.25 0.44
0: 4k 10k =10k 78m 0.07s 1254.17 3.27 0.31
0: 64k 1k =1k 125m 0.04s 3304.56 19.83 0.05
0: 1m 100 =100 200m 0.05s 4421.01 237.18 0.00
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20170213/af680c27/attachment.html>
More information about the Libfabric-users
mailing list