[libfabric-users] Fabtest questions
Stefan Oesterreich
soesterreich at iol.unh.edu
Thu Mar 29 10:20:27 PDT 2018
Hello,
My name is Stefan Oesterreich and I am the Systems Administrator of the
UNH-IOL OFA cluster. The OFIWG would like to include running fabtest as
part of our OFED and vendor device/firmware validation testing. I have very
limited knowledge of fabtest, so I am looking for some guidance on a
comprehensive test command. We test Infiniband, iWARP, and RoCE, and we are
looking to test the verbs provider. The command I have thus far is as
follows:
runfabtests.sh -t all -g $server_transport_ip_addr -s
$server_transport_hostname -c $client_transport_hostname verbs
$server_mgmt_hostname $client_mgmt_hostname
Here is a filled in example:
runfabtests.sh -t all -g 10.1.0.3 -s titan-ib.ofa -c phoebe-ib.ofa verbs
titan.ofa phoebe.ofa
When I run the above command on one of my Infiniband nodes I get the
following output:
# Test Result
# --------------------------------------------------------------
fi_getinfo_test -p "verbs": Pass
fi_av_test -g 10.1.0.3 -n 1 -s titan-ib.ofa -p "verbs": Pass
fi_dom_test -n 2 -p "verbs": Pass
fi_eq_test -p "verbs": Pass
fi_cq_test -p "verbs": Pass
fi_mr_test -p "verbs": Pass
fi_cntr_test -p "verbs": Pass
fi_dgram g00n13s -p "verbs": Pass
fi_rdm g00n13s -p "verbs": Pass
fi_msg g00n13s -p "verbs": Pass
fi_cm_data -p "verbs": Pass
fi_cq_data -p "verbs": Fail
fi_dgram -p "verbs": Notrun
fi_dgram_waitset -p "verbs": Notrun
fi_msg -p "verbs": Pass
fi_msg_epoll -p "verbs": Pass
fi_msg_sockets -p "verbs": Pass
fi_poll -t queue -p "verbs": Notrun
fi_poll -t counter -p "verbs": Notrun
fi_rdm -p "verbs": Pass
fi_rdm_rma_simple -p "verbs": Notrun
fi_rdm_rma_trigger -p "verbs": Notrun
fi_shared_ctx -p "verbs": Notrun
fi_shared_ctx --no-tx-shared-ctx -p "verbs": Notrun
fi_shared_ctx --no-rx-shared-ctx -p "verbs": Notrun
fi_shared_ctx -e msg -p "verbs": Notrun
fi_shared_ctx -e msg --no-tx-shared-ctx -p "verbs": Pass
fi_shared_ctx -e msg --no-rx-shared-ctx -p "verbs": Notrun
fi_shared_ctx -e dgram -p "verbs": Notrun
fi_shared_ctx -e dgram --no-tx-shared-ctx -p "verbs": Notrun
fi_shared_ctx -e dgram --no-rx-shared-ctx -p "verbs": Notrun
fi_rdm_tagged_peek -p "verbs": Pass
fi_scalable_ep -p "verbs": Notrun
fi_cmatose -p "verbs": Pass
fi_rdm_shared_av -p "verbs": Notrun
fi_multi_mr -e msg -V -p "verbs": Notrun
fi_multi_mr -e rdm -V -p "verbs": Notrun
fi_recv_cancel -e rdm -V -p "verbs": Notrun
fi_unexpected_msg -e msg -i 10 -p "verbs": Notrun
fi_unexpected_msg -e rdm -i 10 -p "verbs": Notrun
fi_unexpected_msg -e dgram -i 10 -p "verbs": Notrun
fi_unexpected_msg -e msg -S -i 10 -p "verbs": Notrun
fi_unexpected_msg -e rdm -S -i 10 -p "verbs": Notrun
fi_unexpected_msg -e dgram -S -i 10 -p "verbs": Notrun
fi_msg_pingpong -p "verbs": Pass
fi_msg_pingpong -v -p "verbs": Pass
fi_msg_pingpong -k -p "verbs": Notrun
fi_msg_pingpong -k -v -p "verbs": Notrun
fi_msg_bw -p "verbs": Pass
fi_msg_bw -v -p "verbs": Pass
fi_rma_bw -e msg -o write -p "verbs": Pass
fi_rma_bw -e msg -o read -p "verbs": Pass
fi_rma_bw -e msg -o writedata -p "verbs": Pass
fi_rma_bw -e rdm -o write -p "verbs": Pass
fi_rma_bw -e rdm -o read -p "verbs": Pass
fi_rma_bw -e rdm -o writedata -p "verbs": Fail
fi_msg_rma -o write -p "verbs": Pass
fi_msg_rma -o read -p "verbs": Pass
fi_msg_rma -o writedata -p "verbs": Pass
fi_msg_stream -p "verbs": Pass
fi_rdm_atomic -o all -I 1000 -p "verbs": Notrun
fi_rdm_cntr_pingpong -p "verbs": Notrun
fi_rdm_multi_recv -p "verbs": Fail
fi_rdm_pingpong -p "verbs": Pass
fi_rdm_pingpong -v -p "verbs": Pass
fi_rdm_pingpong -k -p "verbs": Notrun
fi_rdm_pingpong -k -v -p "verbs": Notrun
fi_rdm_rma -o write -p "verbs": Fail
fi_rdm_rma -o read -p "verbs": Fail
fi_rdm_rma -o writedata -p "verbs": Fail
fi_rdm_tagged_pingpong -p "verbs": Pass
fi_rdm_tagged_pingpong -v -p "verbs": Pass
fi_rdm_tagged_bw -p "verbs": Pass
fi_rdm_tagged_bw -v -p "verbs": Pass
fi_dgram_pingpong -p "verbs": Notrun
fi_dgram_pingpong -k -p "verbs": Notrun
fi_rc_pingpong -p "verbs": Pass
fi_ubertest: Server returns 124,
client returns 124
fi_ubertest: Fail [/]
# --------------------------------------------------------------
# Total Pass 38
# Total Notrun 33
# Total Fail 7
# Percentage of Pass 84
# --------------------------------------------------------------
My questions are:
- Is the above command comprehensive enough for all 3 transports (IB,
IW, RoCE)?
- What test mode should I be using
(all,quick,unit,simple,standard,short,complex)? This is the first time
running through this testing, so I don't know if "all" is appropriate here.
Time is also a consideration here, It seems to take about 13 minutes to
complete one server-client pair, and we have 6 nodes, so there are quite a
few permutations.
- What makes a test result "Notrun" vs "Fail"? When I use -vv to see
output, I am seeing a lot of "fi_getinfo(): common/shared.c:540, ret=-61
(No data available)" and "fi_poll_open(): simple/poll.c:55, ret=-38
(Function not implemented)", is this normal?
- I am also seeing a lot of "Killed by signal 15", which I believe means
that the timeout was hit and the run was killed. Should I be increasing my
timeout? I would expect the default timeout to be good enough, but I am
unsure.
- As you can see from the output above, there are a few fails. Does this
indicate a bug in fabtests or OFED/vendors drivers or simply that I am not
running the correct fabtest command?
Thanks in advance, I really appreciate any assistance that you guys can
provide.
--
*-----------------------------------------------Cheers,Stefan Oesterreich*
*High Performance Computing*
*UNH InterOperability
Laboratory------------------------------------------------*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20180329/1d95116d/attachment.html>
More information about the Libfabric-users
mailing list