[libfabric-users] Test results on some machines

Sung-Eun Choi sungeun at cray.com
Mon Feb 13 07:34:28 PST 2017


What version of libfabric are you running?

Do you have all the correct arguments to the various fabtests?  If you
run the provided script with -vv (or maybe -vvv) the verbose output
will include the precise command line for each test.

I've also recall having some issues with using nid numbers, so we
usually use explicit IP address.  This is probably system dependent.

Also, I forgot to answer your question about FI_EP_MSG.  The initial
version is in head of master and will be released with 1.5.

-- Sung

On Mon, Feb 13, 2017 at 03:23:31PM +0000, Biddiscombe, John A. wrote:
> >
>     In order to launch the fabtests with the gni provider, you either need
>     to do it by hand or via CCM mode.  Please see our wiki for directions:
> <
> 
> Sorry, when I collected those outputs, I forgot about the fabtests instructions.
> I have already tried the manual method outlined on the page and it does not give any better results. I’m using the same script to run the fi_pingpong (it works), as for each of the fabtest examples and none of them appear to run properly
> 
> Any other ideas?
> 
> JB
> 
> For example  
> 
> ./frun.sh ~/apps/fabtests/bin/fi_msg_pingpong
> running /users/biddisco/apps/fabtests/bin/fi_msg_pingpong   on nid000[91-92]
> nid00091 is 148.187.32.92
> Generated command is  srun -n 2 --ntasks-per-node=1 -l --multi-prog ./scalable.conf
> 0 /users/biddisco/apps/fabtests/bin/fi_msg_pingpong -p gni
> 1 /users/biddisco/apps/fabtests/bin/fi_msg_pingpong -p gni   nid00091
> 
> 1: fi_connect(): common/shared.c:587, ret=-5 (Input/output error)
> 0: fi_getinfo(): common/shared.c:454, ret=-61 (No data available)
> srun: error: nid00091: task 0: Exited with exit code 61
> srun: Terminating job step 786161.55
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: nid00092: task 1: Exited with exit code 5
> 
> 
> daint102:/scratch/snx3000/biddisco/build$ ./frun.sh ~/apps/fabtests/bin/fi_rdm_rma_simple
> running /users/biddisco/apps/fabtests/bin/fi_rdm_rma_simple   on nid000[91-92]
> nid00091 is 148.187.32.92
> Generated command is  srun -n 2 --ntasks-per-node=1 -l --multi-prog ./scalable.conf
> 0 /users/biddisco/apps/fabtests/bin/fi_rdm_rma_simple -p gni
> 1 /users/biddisco/apps/fabtests/bin/fi_rdm_rma_simple -p gni   nid00091
> 
> 1: fi_getinfo(): common/shared.c:454, ret=-61 (No data available)
> srun: error: nid00092: task 1: Exited with exit code 61
> srun: Terminating job step 786161.57
> 0: slurmstepd: error: *** STEP 786161.57 ON nid00091 CANCELLED AT 2017-02-13T16:20:18 ***
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: nid00091: task 0: Killed
> 
> daint102:/scratch/snx3000/biddisco/build$ ./frun.sh ~/apps/fabtests/bin/fi_msg_bw
> running /users/biddisco/apps/fabtests/bin/fi_msg_bw   on nid000[91-92]
> nid00091 is 148.187.32.92
> Generated command is  srun -n 2 --ntasks-per-node=1 -l --multi-prog ./scalable.conf
> 0 /users/biddisco/apps/fabtests/bin/fi_msg_bw -p gni
> 1 /users/biddisco/apps/fabtests/bin/fi_msg_bw -p gni   nid00091
> 
> 0: fi_getinfo(): common/shared.c:454, ret=-61 (No data available)
> 1: fi_connect(): common/shared.c:587, ret=-5 (Input/output error)
> srun: error: nid00091: task 0: Exited with exit code 61
> srun: Terminating job step 786161.59
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: nid00092: task 1: Exited with exit code 5
> 
> 



More information about the Libfabric-users mailing list