[libfabric-users] Test results on some machines
Sung-Eun Choi
sungeun at cray.com
Mon Feb 13 07:06:38 PST 2017
Hi John,
In order to launch the fabtests with the gni provider, you either need
to do it by hand or via CCM mode. Please see our wiki for directions:
https://github.com/ofi-cray/libfabric-cray/wiki/Running-fabtests-with-the-GNI-provider
Results from head of our master (ofi-cray):
# --------------------------------------------------------------
# Total Pass 52
# Total Notrun 12
# Total Fail 4
# Percentage of Pass 92
# --------------------------------------------------------------
Let us know if there's somewhere else we can put this info to make it
clearer to people who want to run fabtests with the gni provider.
-- Sung
On Mon, Feb 13, 2017 at 10:42:58AM +0000, Biddiscombe, John A. wrote:
> I’m slightly troubled by the results I’ve got on 3 machines, the verbs provider seems to work, though the number of not-run tests is disturbing. The gni provider seems terrible, but when I run fi_pingpong from the libfabric build (not one the fabtests), I can get it working on gni, but not with verbs.
>
>
>
> Oddly the mem reg test on gni fails flat out (when run by hand on a compute node)
>
>
>
> /users/biddisco/apps/fabtests/bin/fi_mr_test
>
> Testing MR on fabric gni
>
> Running mr_reg [Test fi_mr_reg for various buffer sizes]...FAIL: fi_mr_reg failed: ret=12 (Cannot allocate memory)
>
> Running mr_regv [Test fi_mr_regv]...FAIL: fi_mr_regv failed: ret=12 (Cannot allocate memory)
>
> Running mr_regattr [Test fi_mr_regattr]...FAIL: fi_mr_regattr failed: ret=22 (Invalid argument)
>
> Summary: 3 tests failed
>
>
>
> I’ve no idea how the fi_pingpong test manages to work (see end of email for putput) in light of that fail (I presume it uses the mem reg)
>
>
>
> I was hoping that I’d find at least one test that worked on all 3 machines so that I’d have confidence that I could use it as a template to work from. It seems not to be so easy.
>
>
>
> Can anyone shed light on these results and possibly give advice on how to improve the gni behaviour?
>
> NB. if I enable extra output using –vvv, the basic problem with gni seems to be every fails happens due to
>
> fi_getinfo(): common/shared.c:454, ret=-61 (No data available)
>
> or
>
> fi_mr_reg(): util/pingpong.c:1317, ret=-12 (Cannot allocate memory)
>
>
>
> NB2. I tried the gni tests with nid00411 type names instead of ip addresses just in case, but it did not help.
>
>
>
> Thanks
>
>
>
> JB
>
>
>
> In each case I’ve allocated a couple of nodes and found the correct ip addresses for the fabric)
>
>
>
> # --------------------------------------------------------------
>
> Generic cluster with infiniband support
>
> # --------------------------------------------------------------
>
>
>
> $HOME/apps/fabtests/bin/runfabtests.sh -p $HOME/apps/fabtests/bin verbs 192.168.3.36 192.168.3.38
>
>
>
> # Test Result
>
> # --------------------------------------------------------------
>
> fi_getinfo_test -p verbs: Pass
>
> fi_av_test -g 192.168.10.1 -n 1 -s 192.168.3.36 -p verbs: Pass
>
> fi_dom_test -n 2 -p verbs: Pass
>
> fi_eq_test -p verbs: Pass
>
> fi_cq_test -p verbs: Pass
>
> fi_mr_test -p verbs: Pass
>
> fi_size_left_test -p verbs: Pass
>
> fi_dgram g00n13s -p verbs: Pass
>
> fi_rdm g00n13s -p verbs: Pass
>
> fi_msg g00n13s -p verbs: Pass
>
> fi_cm_data -p verbs: Pass
>
> fi_cq_data -p verbs: Pass
>
> fi_dgram -p verbs: Notrun
>
> fi_dgram_waitset -p verbs: Notrun
>
> fi_msg -p verbs: Pass
>
> fi_msg_epoll -p verbs: Pass
>
> fi_msg_sockets -p verbs: Pass
>
> fi_poll -t queue -p verbs: Notrun
>
> fi_poll -t counter -p verbs: Notrun
>
> fi_rdm -p verbs: Pass
>
> fi_rdm_rma_simple -p verbs: Notrun
>
> fi_rdm_rma_trigger -p verbs: Notrun
>
> fi_shared_ctx -p verbs: Notrun
>
> fi_shared_ctx --no-tx-shared-ctx -p verbs: Notrun
>
> fi_shared_ctx --no-rx-shared-ctx -p verbs: Notrun
>
> fi_shared_ctx -e msg -p verbs: Notrun
>
> fi_shared_ctx -e msg --no-tx-shared-ctx -p verbs: Pass
>
> fi_shared_ctx -e msg --no-rx-shared-ctx -p verbs: Notrun
>
> fi_shared_ctx -e dgram -p verbs: Notrun
>
> fi_shared_ctx -e dgram --no-tx-shared-ctx -p verbs: Notrun
>
> fi_shared_ctx -e dgram --no-rx-shared-ctx -p verbs: Notrun
>
> fi_rdm_tagged_peek -p verbs: Pass
>
> fi_scalable_ep -p verbs: Notrun
>
> fi_cmatose -p verbs: Pass
>
> fi_rdm_shared_av -p verbs: Notrun
>
> fi_msg_pingpong -I 5 -p verbs: Pass
>
> fi_msg_bw -I 5 -p verbs: Notrun
>
> fi_rma_bw -e msg -o write -I 5 -p verbs: Notrun
>
> fi_rma_bw -e msg -o read -I 5 -p verbs: Notrun
>
> fi_rma_bw -e msg -o writedata -I 5 -p verbs: Notrun
>
> fi_rma_bw -e rdm -o write -I 5 -p verbs: Notrun
>
> fi_rma_bw -e rdm -o read -I 5 -p verbs: Notrun
>
> fi_rma_bw -e rdm -o writedata -I 5 -p verbs: Notrun
>
> fi_msg_rma -o write -I 5 -p verbs: Pass
>
> fi_msg_rma -o read -I 5 -p verbs: Pass
>
> fi_msg_rma -o writedata -I 5 -p verbs: Pass
>
> fi_msg_stream -I 5 -p verbs: Pass
>
> fi_rdm_atomic -I 5 -o all -p verbs: Notrun
>
> fi_rdm_cntr_pingpong -I 5 -p verbs: Notrun
>
> fi_rdm_multi_recv -I 5 -p verbs: Pass
>
> fi_rdm_pingpong -I 5 -p verbs: Pass
>
> fi_rdm_rma -o write -I 5 -p verbs: Notrun
>
> fi_rdm_rma -o read -I 5 -p verbs: Notrun
>
> fi_rdm_rma -o writedata -I 5 -p verbs: Notrun
>
> fi_rdm_tagged_pingpong -I 5 -p verbs: Pass
>
> fi_rdm_tagged_bw -I 5 -p verbs: Pass
>
> fi_dgram_pingpong -I 5 -p verbs: Notrun
>
> fi_rc_pingpong -n 5 -p verbs: Pass
>
> fi_rc_pingpong -n 5 -e -p verbs: Pass
>
> # --------------------------------------------------------------
>
> # Total Pass 30
>
> # Total Notrun 29
>
> # Total Fail 0
>
> # Percentage of Pass 100
>
> # --------------------------------------------------------------
>
>
>
> # --------------------------------------------------------------
>
> cray xc40 with gni
>
> # --------------------------------------------------------------
>
> $HOME/apps/fabtests/bin/runfabtests.sh -p $HOME/apps/fabtests/bin gni 148.187.33.168 148.187.33.172
>
>
>
> # Test Result
>
> # --------------------------------------------------------------
>
> fi_getinfo_test -p gni: Pass
>
> fi_av_test -g 192.168.10.1 -n 1 -s 148.187.33.168 -p gni: Pass
>
> fi_dom_test -n 2 -p gni: Pass
>
> fi_eq_test -p gni: Pass
>
> fi_cq_test -p gni: Pass
>
> fi_mr_test -p gni: Fail
>
> fi_size_left_test -p gni: Fail
>
> fi_dgram g00n13s -p gni: Pass
>
> fi_rdm g00n13s -p gni: Pass
>
> fi_msg g00n13s -p gni: Pass
>
> fi_cm_data -p gni: Fail
>
> fi_cq_data -p gni: Fail
>
> fi_dgram -p gni: Fail
>
> fi_dgram_waitset -p gni: Fail
>
> fi_msg -p gni: Fail
>
> fi_msg_epoll -p gni: Fail
>
> fi_msg_sockets -p gni: Fail
>
> fi_poll -t queue -p gni: Fail
>
> fi_poll -t counter -p gni: Fail
>
> fi_rdm -p gni: Fail
>
> fi_rdm_rma_simple -p gni: Notrun
>
> fi_rdm_rma_trigger -p gni: Notrun
>
> fi_shared_ctx -p gni: Notrun
>
> fi_shared_ctx --no-tx-shared-ctx -p gni: Notrun
>
> fi_shared_ctx --no-rx-shared-ctx -p gni: Fail
>
> fi_shared_ctx -e msg -p gni: Notrun
>
> fi_shared_ctx -e msg --no-tx-shared-ctx -p gni: Notrun
>
> fi_shared_ctx -e msg --no-rx-shared-ctx -p gni: Fail
>
> fi_shared_ctx -e dgram -p gni: Notrun
>
> fi_shared_ctx -e dgram --no-tx-shared-ctx -p gni: Notrun
>
> fi_shared_ctx -e dgram --no-rx-shared-ctx -p gni: Fail
>
> fi_rdm_tagged_peek -p gni: Fail
>
> fi_scalable_ep -p gni: Fail
>
> fi_cmatose -p gni: Fail
>
> fi_rdm_shared_av -p gni: Fail
>
> fi_msg_pingpong -I 5 -p gni: Fail
>
> fi_msg_bw -I 5 -p gni: Fail
>
> fi_rma_bw -e msg -o write -I 5 -p gni: Fail
>
> fi_rma_bw -e msg -o read -I 5 -p gni: Fail
>
> fi_rma_bw -e msg -o writedata -I 5 -p gni: Fail
>
> fi_rma_bw -e rdm -o write -I 5 -p gni: Fail
>
> fi_rma_bw -e rdm -o read -I 5 -p gni: Fail
>
> fi_rma_bw -e rdm -o writedata -I 5 -p gni: Fail
>
> fi_msg_rma -o write -I 5 -p gni: Fail
>
> fi_msg_rma -o read -I 5 -p gni: Fail
>
> fi_msg_rma -o writedata -I 5 -p gni: Fail
>
> fi_msg_stream -I 5 -p gni: Fail
>
> fi_rdm_atomic -I 5 -o all -p gni: Fail
>
> fi_rdm_cntr_pingpong -I 5 -p gni: Fail
>
> fi_rdm_multi_recv -I 5 -p gni: Fail
>
> fi_rdm_pingpong -I 5 -p gni: Fail
>
> fi_rdm_rma -o write -I 5 -p gni: Fail
>
> fi_rdm_rma -o read -I 5 -p gni: Fail
>
> fi_rdm_rma -o writedata -I 5 -p gni: Fail
>
> fi_rdm_tagged_pingpong -I 5 -p gni: Fail
>
> fi_rdm_tagged_bw -I 5 -p gni: Fail
>
> fi_dgram_pingpong -I 5 -p gni: Fail
>
> fi_rc_pingpong -n 5 -p gni: Fail
>
> fi_rc_pingpong -n 5 -e -p gni: Fail
>
> # --------------------------------------------------------------
>
> # Total Pass 8
>
> # Total Notrun 8
>
> # Total Fail 43
>
> # Percentage of Pass 15
>
> # --------------------------------------------------------------
>
>
>
> # --------------------------------------------------------------
>
> cray with omnipath and verbs
>
> # --------------------------------------------------------------
>
>
>
> $HOME/apps/fabtests/bin/runfabtests.sh -p $HOME/apps/fabtests/bin verbs 192.168.18.65 192.168.18.66
>
>
>
> # Test Result
>
> # --------------------------------------------------------------
>
> fi_getinfo_test -p verbs: Pass
>
> fi_av_test -g 192.168.10.1 -n 1 -s 192.168.18.65 -p verbs: Pass
>
> fi_dom_test -n 2 -p verbs: Pass
>
> fi_eq_test -p verbs: Pass
>
> fi_cq_test -p verbs: Pass
>
> fi_mr_test -p verbs: Pass
>
> fi_size_left_test -p verbs: Pass
>
> fi_dgram g00n13s -p verbs: Pass
>
> fi_rdm g00n13s -p verbs: Pass
>
> fi_msg g00n13s -p verbs: Pass
>
> fi_cm_data -p verbs: Pass
>
> fi_cq_data -p verbs: Pass
>
> fi_dgram -p verbs: Notrun
>
> fi_dgram_waitset -p verbs: Notrun
>
> fi_msg -p verbs: Pass
>
> fi_msg_epoll -p verbs: Pass
>
> fi_msg_sockets -p verbs: Pass
>
> fi_poll -t queue -p verbs: Notrun
>
> fi_poll -t counter -p verbs: Notrun
>
> fi_rdm -p verbs: Pass
>
> fi_rdm_rma_simple -p verbs: Notrun
>
> fi_rdm_rma_trigger -p verbs: Notrun
>
> fi_shared_ctx -p verbs: Notrun
>
> fi_shared_ctx --no-tx-shared-ctx -p verbs: Notrun
>
> fi_shared_ctx --no-rx-shared-ctx -p verbs: Notrun
>
> fi_shared_ctx -e msg -p verbs: Notrun
>
> fi_shared_ctx -e msg --no-tx-shared-ctx -p verbs: Pass
>
> fi_shared_ctx -e msg --no-rx-shared-ctx -p verbs: Notrun
>
> fi_shared_ctx -e dgram -p verbs: Notrun
>
> fi_shared_ctx -e dgram --no-tx-shared-ctx -p verbs: Notrun
>
> fi_shared_ctx -e dgram --no-rx-shared-ctx -p verbs: Notrun
>
> fi_rdm_tagged_peek -p verbs: Pass
>
> fi_scalable_ep -p verbs: Notrun
>
> fi_cmatose -p verbs: Pass
>
> fi_rdm_shared_av -p verbs: Notrun
>
> fi_msg_pingpong -I 5 -p verbs: Pass
>
> fi_msg_bw -I 5 -p verbs: Notrun
>
> fi_rma_bw -e msg -o write -I 5 -p verbs: Notrun
>
> fi_rma_bw -e msg -o read -I 5 -p verbs: Notrun
>
> fi_rma_bw -e msg -o writedata -I 5 -p verbs: Notrun
>
> fi_rma_bw -e rdm -o write -I 5 -p verbs: Notrun
>
> fi_rma_bw -e rdm -o read -I 5 -p verbs: Notrun
>
> fi_rma_bw -e rdm -o writedata -I 5 -p verbs: Notrun
>
> fi_msg_rma -o write -I 5 -p verbs: Pass
>
> fi_msg_rma -o read -I 5 -p verbs: Pass
>
> fi_msg_rma -o writedata -I 5 -p verbs: Pass
>
> fi_msg_stream -I 5 -p verbs: Pass
>
> fi_rdm_atomic -I 5 -o all -p verbs: Notrun
>
> fi_rdm_cntr_pingpong -I 5 -p verbs: Notrun
>
> fi_rdm_multi_recv -I 5 -p verbs: Pass
>
> fi_rdm_pingpong -I 5 -p verbs: Pass
>
> fi_rdm_rma -o write -I 5 -p verbs: Notrun
>
> fi_rdm_rma -o read -I 5 -p verbs: Notrun
>
> fi_rdm_rma -o writedata -I 5 -p verbs: Notrun
>
> fi_rdm_tagged_pingpong -I 5 -p verbs: Pass
>
> fi_rdm_tagged_bw -I 5 -p verbs: Pass
>
> fi_dgram_pingpong -I 5 -p verbs: Notrun
>
> fi_rc_pingpong -n 5 -p verbs: Pass
>
> fi_rc_pingpong -n 5 -e -p verbs: Pass
>
> # --------------------------------------------------------------
>
> # Total Pass 30
>
> # Total Notrun 29
>
> # Total Fail 0
>
> # Percentage of Pass 100
>
> # --------------------------------------------------------------
>
>
>
> # --------------------------------------------------------------
>
> # ping pong test from libfabric on gni
>
> # --------------------------------------------------------------
>
>
>
> ./frun.sh /users/biddisco/apps/libfabric/bin/fi_pingpong
>
> running /users/biddisco/apps/libfabric/bin/fi_pingpong on nid00[421,425]
>
> nid00421 is 148.187.33.168
>
> Generated command is srun -n 2 --ntasks-per-node=1 -l --multi-prog ./scalable.conf
>
> 0 /users/biddisco/apps/libfabric/bin/fi_pingpong -p gni
>
> 1 /users/biddisco/apps/libfabric/bin/fi_pingpong -p gni nid00421
>
>
>
> 1: bytes #sent #ack total time MB/sec usec/xfer Mxfers/sec
>
> 1: 64 10k =10k 1.2m 0.05s 25.52 2.51 0.40
>
> 1: 256 10k =10k 4.8m 0.06s 85.54 2.99 0.33
>
> 1: 1k 10k =10k 19m 0.05s 454.41 2.25 0.44
>
> 1: 4k 10k =10k 78m 0.07s 1254.25 3.27 0.31
>
> 1: 64k 1k =1k 125m 0.04s 3304.97 19.83 0.05
>
> 1: 1m 100 =100 200m 0.05s 4422.23 237.12 0.00
>
> 0: bytes #sent #ack total time MB/sec usec/xfer Mxfers/sec
>
> 0: 64 10k =10k 1.2m 0.05s 25.52 2.51 0.40
>
> 0: 256 10k =10k 4.8m 0.06s 85.53 2.99 0.33
>
> 0: 1k 10k =10k 19m 0.05s 454.35 2.25 0.44
>
> 0: 4k 10k =10k 78m 0.07s 1254.17 3.27 0.31
>
> 0: 64k 1k =1k 125m 0.04s 3304.56 19.83 0.05
>
> 0: 1m 100 =100 200m 0.05s 4421.01 237.18 0.00
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
More information about the Libfabric-users
mailing list