[ofiwg] [libfabric-users] Fabtest questions

Ilango, Arun arun.ilango at intel.com
Thu Mar 29 12:20:01 PDT 2018


Hi Stefan,

What version of libfabric and fabtests are you using? Can you try the test with libfabric v1.6 and fabtests v1.6 or upstream?

> *	What test mode should I be using
> (all,quick,unit,simple,standard,short,complex)? This is the first time 
> running through this testing, so I don't know if "all" is appropriate 
> here. Time is also a consideration here, It seems to take about 13 
> minutes to complete one server-client pair, and we have 6 nodes, so 
> there are quite a few permutations.

>Using 'all' versus 'quick' adds in fi_ubertest.  This test is fairly comprehensive.  It is capable of testing thousands of permuations
> and can take a really long time to run.  If time is a concern, I would use the quick option, which is the default.

quick also reduces the # of iterations in pingpong, bandwidth and streaming tests to 5.

>For verbs, we focus testing on the 'msg' endpoints.  I would not expect to see any failures there.  The 'dgram' endpoint support 
>is limited in its implementation.  'Rdm' endpoints are being removed in favor of using the 'rxm' provider over verbs

From libfabric 1.6 onwards, running RDM endpoint tests on verbs should pick the rxm code path automatically and there shouldn't be any failures.

Thanks,
Arun.

-----Original Message-----
From: Hefty, Sean 
Sent: Thursday, March 29, 2018 10:47 AM
To: Stefan Oesterreich <soesterreich at iol.unh.edu>; libfabric-users at lists.openfabrics.org; ofiwg at lists.openfabrics.org
Cc: Ilango, Arun <arun.ilango at intel.com>
Subject: RE: [libfabric-users] Fabtest questions

copying ofiwg and the verbs maintainer.

> My name is Stefan Oesterreich and I am the Systems Administrator of 
> the UNH-IOL OFA cluster. The OFIWG would like to include running 
> fabtest as part of our OFED and vendor device/firmware validation 
> testing. I have very limited knowledge of fabtest, so I am looking for 
> some guidance on a comprehensive test command. We test Infiniband, 
> iWARP, and RoCE, and we are looking to test the verbs provider. The 
> command I have thus far is as follows:
> 
> runfabtests.sh -t all -g $server_transport_ip_addr -s 
> $server_transport_hostname -c $client_transport_hostname verbs 
> $server_mgmt_hostname $client_mgmt_hostname
> 
> 
> Here is a filled in example:
> runfabtests.sh -t all -g 10.1.0.3 -s titan-ib.ofa -c phoebe-ib.ofa 
> verbs titan.ofa phoebe.ofa
> 
> 
> When I run the above command on one of my Infiniband nodes I get the 
> following output:
> 
> # Test                                                  Result
> # --------------------------------------------------------------
> fi_getinfo_test -p "verbs":                             Pass
> fi_av_test -g 10.1.0.3 -n 1 -s titan-ib.ofa -p "verbs":      Pass
> fi_dom_test -n 2 -p "verbs":                            Pass
> fi_eq_test -p "verbs":                                  Pass
> fi_cq_test -p "verbs":                                  Pass
> fi_mr_test -p "verbs":                                  Pass
> fi_cntr_test -p "verbs":                                Pass
> fi_dgram g00n13s -p "verbs":                            Pass
> fi_rdm g00n13s -p "verbs":                              Pass
> fi_msg g00n13s -p "verbs":                              Pass
> fi_cm_data -p "verbs":                                  Pass
> fi_cq_data -p "verbs":                                  Fail
> fi_dgram -p "verbs":                                  Notrun
> fi_dgram_waitset -p "verbs":                          Notrun
> fi_msg -p "verbs":                                      Pass
> fi_msg_epoll -p "verbs":                                Pass
> fi_msg_sockets -p "verbs":                              Pass
> fi_poll -t queue -p "verbs":                          Notrun
> fi_poll -t counter -p "verbs":                        Notrun
> fi_rdm -p "verbs":                                      Pass
> fi_rdm_rma_simple -p "verbs":                         Notrun
> fi_rdm_rma_trigger -p "verbs":                        Notrun
> fi_shared_ctx -p "verbs":                             Notrun
> fi_shared_ctx --no-tx-shared-ctx -p "verbs":          Notrun
> fi_shared_ctx --no-rx-shared-ctx -p "verbs":          Notrun
> fi_shared_ctx -e msg -p "verbs":                      Notrun
> fi_shared_ctx -e msg --no-tx-shared-ctx -p "verbs":      Pass
> fi_shared_ctx -e msg --no-rx-shared-ctx -p "verbs":    Notrun
> fi_shared_ctx -e dgram -p "verbs":                    Notrun
> fi_shared_ctx -e dgram --no-tx-shared-ctx -p "verbs":    Notrun
> fi_shared_ctx -e dgram --no-rx-shared-ctx -p "verbs":    Notrun
> fi_rdm_tagged_peek -p "verbs":                          Pass
> fi_scalable_ep -p "verbs":                            Notrun
> fi_cmatose -p "verbs":                                  Pass
> fi_rdm_shared_av -p "verbs":                          Notrun
> fi_multi_mr -e msg -V -p "verbs":                     Notrun
> fi_multi_mr -e rdm -V -p "verbs":                     Notrun
> fi_recv_cancel -e rdm -V -p "verbs":                  Notrun
> fi_unexpected_msg -e msg -i 10 -p "verbs":            Notrun
> fi_unexpected_msg -e rdm -i 10 -p "verbs":            Notrun
> fi_unexpected_msg -e dgram -i 10 -p "verbs":          Notrun
> fi_unexpected_msg -e msg -S -i 10 -p "verbs":         Notrun
> fi_unexpected_msg -e rdm -S -i 10 -p "verbs":         Notrun
> fi_unexpected_msg -e dgram -S -i 10 -p "verbs":       Notrun
> fi_msg_pingpong -p "verbs":                             Pass
> fi_msg_pingpong -v -p "verbs":                          Pass
> fi_msg_pingpong -k -p "verbs":                        Notrun
> fi_msg_pingpong -k -v -p "verbs":                     Notrun
> fi_msg_bw -p "verbs":                                   Pass
> fi_msg_bw -v -p "verbs":                                Pass
> fi_rma_bw -e msg -o write -p "verbs":                   Pass
> fi_rma_bw -e msg -o read -p "verbs":                    Pass
> fi_rma_bw -e msg -o writedata -p "verbs":               Pass
> fi_rma_bw -e rdm -o write -p "verbs":                   Pass
> fi_rma_bw -e rdm -o read -p "verbs":                    Pass
> fi_rma_bw -e rdm -o writedata -p "verbs":               Fail
> fi_msg_rma -o write -p "verbs":                         Pass
> fi_msg_rma -o read -p "verbs":                          Pass
> fi_msg_rma -o writedata -p "verbs":                     Pass
> fi_msg_stream -p "verbs":                               Pass
> fi_rdm_atomic -o all -I 1000 -p "verbs":              Notrun
> fi_rdm_cntr_pingpong -p "verbs":                      Notrun
> fi_rdm_multi_recv -p "verbs":                           Fail
> fi_rdm_pingpong -p "verbs":                             Pass
> fi_rdm_pingpong -v -p "verbs":                          Pass
> fi_rdm_pingpong -k -p "verbs":                        Notrun
> fi_rdm_pingpong -k -v -p "verbs":                     Notrun
> fi_rdm_rma -o write -p "verbs":                         Fail
> fi_rdm_rma -o read -p "verbs":                          Fail
> fi_rdm_rma -o writedata -p "verbs":                     Fail
> fi_rdm_tagged_pingpong -p "verbs":                      Pass
> fi_rdm_tagged_pingpong -v -p "verbs":                   Pass
> fi_rdm_tagged_bw -p "verbs":                            Pass
> fi_rdm_tagged_bw -v -p "verbs":                         Pass
> fi_dgram_pingpong -p "verbs":                         Notrun
> fi_dgram_pingpong -k -p "verbs":                      Notrun
> fi_rc_pingpong -p "verbs":                              Pass
> fi_ubertest:                                      Server returns
> 124, client returns 124
> fi_ubertest:                                        Fail [/]
> # --------------------------------------------------------------
> # Total Pass                                                38
> # Total Notrun                                              33
> # Total Fail                                                 7
> # Percentage of Pass                                        84
> # --------------------------------------------------------------
> 
> 
> 
> My questions are:
> 
> 
> *	Is the above command comprehensive enough for all 3 transports
> (IB, IW, RoCE)?

All transports should be testable using the same configuration.

> *	What test mode should I be using
> (all,quick,unit,simple,standard,short,complex)? This is the first time 
> running through this testing, so I don't know if "all" is appropriate 
> here. Time is also a consideration here, It seems to take about 13 
> minutes to complete one server-client pair, and we have 6 nodes, so 
> there are quite a few permutations.

Using 'all' versus 'quick' adds in fi_ubertest.  This test is fairly comprehensive.  It is capable of testing thousands of permuations and can take a really long time to run.  If time is a concern, I would use the quick option, which is the default.

You can also speed up testing by providing an 'exclude' file.  This will allow skipping the Notrun tests, which add a couple of seconds per test.  See e.g. test_configs/verbs/verbs.exclude. 

> *	What makes a test result "Notrun" vs "Fail"? When I use -vv to
> see output, I am seeing a lot of "fi_getinfo(): common/shared.c:540,
> ret=-61 (No data available)" and "fi_poll_open(): simple/poll.c:55,
> ret=-38 (Function not implemented)", is this normal?

Notrun indicates that the selected provider does not support the options required of the test.  For example, the verbs provider does not implement counters, so any test that use counters will not work.  There are specific failure codes that the script checks for in these cases and reports 'notrun' when it detects them.

Fail indicates that the test detected some other sort of error from the provider that was not expected.  From the output above:

	fi_cq_data -p "verbs":                                  Fail

This is a test I would expect to pass.  With failures, it's usually best to re-run the test that failed directly and see if we can get more information as to why a failure occurred.

> *	I am also seeing a lot of "Killed by signal 15", which I
> believe means that the timeout was hit and the run was killed.
> Should I be increasing my timeout? I would expect the default timeout 
> to be good enough, but I am unsure.

The default timeout should be sufficient in most cases.

> *	As you can see from the output above, there are a few fails.
> Does this indicate a bug in fabtests or OFED/vendors drivers or simply 
> that I am not running the correct fabtest command?

Yes.  :)  Any failures need to be investigated.  For verbs, we focus testing on the 'msg' endpoints.  I would not expect to see any failures there.  The 'dgram' endpoint support is limited in its implementation.  'Rdm' endpoints are being removed in favor of using the 'rxm' provider over verbs.  Using the exclude file may remove these from testing.

- Sean


More information about the ofiwg mailing list