[ofiwg] [libfabric-users] Fabtest questions

Stefan Oesterreich soesterreich at iol.unh.edu
Fri Mar 30 14:06:27 PDT 2018


Hi Sean, Arun,

Thanks for that info, it helps a lot! I was able to install libfabric and
fabtest from the git repo into /opt, so I am running the latest. I am
seeing a couple fails still, but I want to start with one in particular,
because this test is making it impossible to run to completion. Once we get
this sorted I'll be able to send you nice clean log of fails efficiently.
The command I am running is:

/opt/fabtests/bin/runfabtests.sh -v -p /opt/fabtests/bin -t quick -g
10.2.0.94 -s enceladus-iw.ofa -c erriapus-iw.ofa -e
/opt/fabtests/share/fabtests/test_configs/verbs/verbs.exclude verbs
enceladus.ofa erriapus.ofa

which runs through some tests (all but one pass), and then gets to the test
seen below. It continues to repeat the server_stdout lines forever. Do you
have any thoughts on what is happening?

- name:   fi_poll -t counter -p "verbs"
  result: Fail
  time:   91
  server_cmd: /opt/fabtests/bin/fi_poll -t counter -p "verbs" -s
enceladus-iw.ofa
  server_stdout: |
    fi_cntr_wait(): common/shared.c:1976, ret=-38 (Function not implemented)
    fi_cntr_wait(): common/shared.c:1976, ret=-38 (Function not implemented)
    fi_cntr_wait(): common/shared.c:1976, ret=-38 (Function not implemented)
    fi_cntr_wait(): common/shared.c:1976, ret=-38 (Function not implemented)
    fi_cntr_wait(): common/shared.c:1976, ret=-38 (Function not implemented)
    fi_cntr_wait(): common/shared.c:1976, ret=-38 (Function not implemented)
    fi_cntr_wait(): common/shared.c:1976, ret=-38 (Function not implemented)
    fi_cntr_wait(): common/shared.c:1976, ret=-38 (Function not implemented)


Cheers,
Stefan

On Thu, Mar 29, 2018 at 3:20 PM, Ilango, Arun <arun.ilango at intel.com> wrote:

> Hi Stefan,
>
> What version of libfabric and fabtests are you using? Can you try the test
> with libfabric v1.6 and fabtests v1.6 or upstream?
>
> > *     What test mode should I be using
> > (all,quick,unit,simple,standard,short,complex)? This is the first time
> > running through this testing, so I don't know if "all" is appropriate
> > here. Time is also a consideration here, It seems to take about 13
> > minutes to complete one server-client pair, and we have 6 nodes, so
> > there are quite a few permutations.
>
> >Using 'all' versus 'quick' adds in fi_ubertest.  This test is fairly
> comprehensive.  It is capable of testing thousands of permuations
> > and can take a really long time to run.  If time is a concern, I would
> use the quick option, which is the default.
>
> quick also reduces the # of iterations in pingpong, bandwidth and
> streaming tests to 5.
>
> >For verbs, we focus testing on the 'msg' endpoints.  I would not expect
> to see any failures there.  The 'dgram' endpoint support
> >is limited in its implementation.  'Rdm' endpoints are being removed in
> favor of using the 'rxm' provider over verbs
>
> From libfabric 1.6 onwards, running RDM endpoint tests on verbs should
> pick the rxm code path automatically and there shouldn't be any failures.
>
> Thanks,
> Arun.
>
> -----Original Message-----
> From: Hefty, Sean
> Sent: Thursday, March 29, 2018 10:47 AM
> To: Stefan Oesterreich <soesterreich at iol.unh.edu>; libfabric-users at lists.
> openfabrics.org; ofiwg at lists.openfabrics.org
> Cc: Ilango, Arun <arun.ilango at intel.com>
> Subject: RE: [libfabric-users] Fabtest questions
>
> copying ofiwg and the verbs maintainer.
>
> > My name is Stefan Oesterreich and I am the Systems Administrator of
> > the UNH-IOL OFA cluster. The OFIWG would like to include running
> > fabtest as part of our OFED and vendor device/firmware validation
> > testing. I have very limited knowledge of fabtest, so I am looking for
> > some guidance on a comprehensive test command. We test Infiniband,
> > iWARP, and RoCE, and we are looking to test the verbs provider. The
> > command I have thus far is as follows:
> >
> > runfabtests.sh -t all -g $server_transport_ip_addr -s
> > $server_transport_hostname -c $client_transport_hostname verbs
> > $server_mgmt_hostname $client_mgmt_hostname
> >
> >
> > Here is a filled in example:
> > runfabtests.sh -t all -g 10.1.0.3 -s titan-ib.ofa -c phoebe-ib.ofa
> > verbs titan.ofa phoebe.ofa
> >
> >
> > When I run the above command on one of my Infiniband nodes I get the
> > following output:
> >
> > # Test                                                  Result
> > # --------------------------------------------------------------
> > fi_getinfo_test -p "verbs":                             Pass
> > fi_av_test -g 10.1.0.3 -n 1 -s titan-ib.ofa -p "verbs":      Pass
> > fi_dom_test -n 2 -p "verbs":                            Pass
> > fi_eq_test -p "verbs":                                  Pass
> > fi_cq_test -p "verbs":                                  Pass
> > fi_mr_test -p "verbs":                                  Pass
> > fi_cntr_test -p "verbs":                                Pass
> > fi_dgram g00n13s -p "verbs":                            Pass
> > fi_rdm g00n13s -p "verbs":                              Pass
> > fi_msg g00n13s -p "verbs":                              Pass
> > fi_cm_data -p "verbs":                                  Pass
> > fi_cq_data -p "verbs":                                  Fail
> > fi_dgram -p "verbs":                                  Notrun
> > fi_dgram_waitset -p "verbs":                          Notrun
> > fi_msg -p "verbs":                                      Pass
> > fi_msg_epoll -p "verbs":                                Pass
> > fi_msg_sockets -p "verbs":                              Pass
> > fi_poll -t queue -p "verbs":                          Notrun
> > fi_poll -t counter -p "verbs":                        Notrun
> > fi_rdm -p "verbs":                                      Pass
> > fi_rdm_rma_simple -p "verbs":                         Notrun
> > fi_rdm_rma_trigger -p "verbs":                        Notrun
> > fi_shared_ctx -p "verbs":                             Notrun
> > fi_shared_ctx --no-tx-shared-ctx -p "verbs":          Notrun
> > fi_shared_ctx --no-rx-shared-ctx -p "verbs":          Notrun
> > fi_shared_ctx -e msg -p "verbs":                      Notrun
> > fi_shared_ctx -e msg --no-tx-shared-ctx -p "verbs":      Pass
> > fi_shared_ctx -e msg --no-rx-shared-ctx -p "verbs":    Notrun
> > fi_shared_ctx -e dgram -p "verbs":                    Notrun
> > fi_shared_ctx -e dgram --no-tx-shared-ctx -p "verbs":    Notrun
> > fi_shared_ctx -e dgram --no-rx-shared-ctx -p "verbs":    Notrun
> > fi_rdm_tagged_peek -p "verbs":                          Pass
> > fi_scalable_ep -p "verbs":                            Notrun
> > fi_cmatose -p "verbs":                                  Pass
> > fi_rdm_shared_av -p "verbs":                          Notrun
> > fi_multi_mr -e msg -V -p "verbs":                     Notrun
> > fi_multi_mr -e rdm -V -p "verbs":                     Notrun
> > fi_recv_cancel -e rdm -V -p "verbs":                  Notrun
> > fi_unexpected_msg -e msg -i 10 -p "verbs":            Notrun
> > fi_unexpected_msg -e rdm -i 10 -p "verbs":            Notrun
> > fi_unexpected_msg -e dgram -i 10 -p "verbs":          Notrun
> > fi_unexpected_msg -e msg -S -i 10 -p "verbs":         Notrun
> > fi_unexpected_msg -e rdm -S -i 10 -p "verbs":         Notrun
> > fi_unexpected_msg -e dgram -S -i 10 -p "verbs":       Notrun
> > fi_msg_pingpong -p "verbs":                             Pass
> > fi_msg_pingpong -v -p "verbs":                          Pass
> > fi_msg_pingpong -k -p "verbs":                        Notrun
> > fi_msg_pingpong -k -v -p "verbs":                     Notrun
> > fi_msg_bw -p "verbs":                                   Pass
> > fi_msg_bw -v -p "verbs":                                Pass
> > fi_rma_bw -e msg -o write -p "verbs":                   Pass
> > fi_rma_bw -e msg -o read -p "verbs":                    Pass
> > fi_rma_bw -e msg -o writedata -p "verbs":               Pass
> > fi_rma_bw -e rdm -o write -p "verbs":                   Pass
> > fi_rma_bw -e rdm -o read -p "verbs":                    Pass
> > fi_rma_bw -e rdm -o writedata -p "verbs":               Fail
> > fi_msg_rma -o write -p "verbs":                         Pass
> > fi_msg_rma -o read -p "verbs":                          Pass
> > fi_msg_rma -o writedata -p "verbs":                     Pass
> > fi_msg_stream -p "verbs":                               Pass
> > fi_rdm_atomic -o all -I 1000 -p "verbs":              Notrun
> > fi_rdm_cntr_pingpong -p "verbs":                      Notrun
> > fi_rdm_multi_recv -p "verbs":                           Fail
> > fi_rdm_pingpong -p "verbs":                             Pass
> > fi_rdm_pingpong -v -p "verbs":                          Pass
> > fi_rdm_pingpong -k -p "verbs":                        Notrun
> > fi_rdm_pingpong -k -v -p "verbs":                     Notrun
> > fi_rdm_rma -o write -p "verbs":                         Fail
> > fi_rdm_rma -o read -p "verbs":                          Fail
> > fi_rdm_rma -o writedata -p "verbs":                     Fail
> > fi_rdm_tagged_pingpong -p "verbs":                      Pass
> > fi_rdm_tagged_pingpong -v -p "verbs":                   Pass
> > fi_rdm_tagged_bw -p "verbs":                            Pass
> > fi_rdm_tagged_bw -v -p "verbs":                         Pass
> > fi_dgram_pingpong -p "verbs":                         Notrun
> > fi_dgram_pingpong -k -p "verbs":                      Notrun
> > fi_rc_pingpong -p "verbs":                              Pass
> > fi_ubertest:                                      Server returns
> > 124, client returns 124
> > fi_ubertest:                                        Fail [/]
> > # --------------------------------------------------------------
> > # Total Pass                                                38
> > # Total Notrun                                              33
> > # Total Fail                                                 7
> > # Percentage of Pass                                        84
> > # --------------------------------------------------------------
> >
> >
> >
> > My questions are:
> >
> >
> > *     Is the above command comprehensive enough for all 3 transports
> > (IB, IW, RoCE)?
>
> All transports should be testable using the same configuration.
>
> > *     What test mode should I be using
> > (all,quick,unit,simple,standard,short,complex)? This is the first time
> > running through this testing, so I don't know if "all" is appropriate
> > here. Time is also a consideration here, It seems to take about 13
> > minutes to complete one server-client pair, and we have 6 nodes, so
> > there are quite a few permutations.
>
> Using 'all' versus 'quick' adds in fi_ubertest.  This test is fairly
> comprehensive.  It is capable of testing thousands of permuations and can
> take a really long time to run.  If time is a concern, I would use the
> quick option, which is the default.
>
> You can also speed up testing by providing an 'exclude' file.  This will
> allow skipping the Notrun tests, which add a couple of seconds per test.
> See e.g. test_configs/verbs/verbs.exclude.
>
> > *     What makes a test result "Notrun" vs "Fail"? When I use -vv to
> > see output, I am seeing a lot of "fi_getinfo(): common/shared.c:540,
> > ret=-61 (No data available)" and "fi_poll_open(): simple/poll.c:55,
> > ret=-38 (Function not implemented)", is this normal?
>
> Notrun indicates that the selected provider does not support the options
> required of the test.  For example, the verbs provider does not implement
> counters, so any test that use counters will not work.  There are specific
> failure codes that the script checks for in these cases and reports
> 'notrun' when it detects them.
>
> Fail indicates that the test detected some other sort of error from the
> provider that was not expected.  From the output above:
>
>         fi_cq_data -p "verbs":                                  Fail
>
> This is a test I would expect to pass.  With failures, it's usually best
> to re-run the test that failed directly and see if we can get more
> information as to why a failure occurred.
>
> > *     I am also seeing a lot of "Killed by signal 15", which I
> > believe means that the timeout was hit and the run was killed.
> > Should I be increasing my timeout? I would expect the default timeout
> > to be good enough, but I am unsure.
>
> The default timeout should be sufficient in most cases.
>
> > *     As you can see from the output above, there are a few fails.
> > Does this indicate a bug in fabtests or OFED/vendors drivers or simply
> > that I am not running the correct fabtest command?
>
> Yes.  :)  Any failures need to be investigated.  For verbs, we focus
> testing on the 'msg' endpoints.  I would not expect to see any failures
> there.  The 'dgram' endpoint support is limited in its implementation.
> 'Rdm' endpoints are being removed in favor of using the 'rxm' provider over
> verbs.  Using the exclude file may remove these from testing.
>
> - Sean
>



-- 


*-----------------------------------------------Cheers,Stefan Oesterreich*
*High Performance Computing*


*UNH InterOperability
Laboratory------------------------------------------------*
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ofiwg/attachments/20180330/ecc7075a/attachment.html>


More information about the ofiwg mailing list