[libfabric-users] Test results on some machines

Ilango, Arun arun.ilango at intel.com
Mon Feb 13 09:46:46 PST 2017


> libfabric:ofi-rxm:core:ofi_check_ep_attr():397<info> Unsupported endpoint type
> libfabric:ofi-rxm:core:ofi_check_ep_attr():398<info> Supported: FI_EP_RDM
> libfabric:ofi-rxm:core:ofi_check_ep_attr():398<info> Requested: FI_EP_MSG

I want to add here that these logs correspond to the rxm provider and not gni. The gni provider logs would be prefixed with "libfabric:ofi-gni"

> the verbs provider seems to work, though the number of not-run tests is disturbing
The verbs provider would support only a minimal set of libfabric features. Please take a look at fi_verbs man page to know the supported features.

> I run fi_pingpong from the libfabric build (not one the fabtests), I can get it working on gni, but not with verbs.
This shouldn't happen. Let me look into this.

Thanks,
Arun.

-----Original Message-----
From: Libfabric-users [mailto:libfabric-users-bounces at lists.openfabrics.org] On Behalf Of Sung-Eun Choi
Sent: Monday, February 13, 2017 7:34 AM
To: Biddiscombe, John A. <biddisco at cscs.ch>
Cc: libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Test results on some machines

What version of libfabric are you running?

Do you have all the correct arguments to the various fabtests?  If you run the provided script with -vv (or maybe -vvv) the verbose output will include the precise command line for each test.

I've also recall having some issues with using nid numbers, so we usually use explicit IP address.  This is probably system dependent.

Also, I forgot to answer your question about FI_EP_MSG.  The initial version is in head of master and will be released with 1.5.

-- Sung

On Mon, Feb 13, 2017 at 03:23:31PM +0000, Biddiscombe, John A. wrote:
> >
>     In order to launch the fabtests with the gni provider, you either need
>     to do it by hand or via CCM mode.  Please see our wiki for directions:
> <
> 
> Sorry, when I collected those outputs, I forgot about the fabtests instructions.
> I have already tried the manual method outlined on the page and it 
> does not give any better results. I’m using the same script to run the 
> fi_pingpong (it works), as for each of the fabtest examples and none 
> of them appear to run properly
> 
> Any other ideas?
> 
> JB
> 
> For example
> 
> ./frun.sh ~/apps/fabtests/bin/fi_msg_pingpong
> running /users/biddisco/apps/fabtests/bin/fi_msg_pingpong   on nid000[91-92]
> nid00091 is 148.187.32.92
> Generated command is  srun -n 2 --ntasks-per-node=1 -l --multi-prog 
> ./scalable.conf
> 0 /users/biddisco/apps/fabtests/bin/fi_msg_pingpong -p gni
> 1 /users/biddisco/apps/fabtests/bin/fi_msg_pingpong -p gni   nid00091
> 
> 1: fi_connect(): common/shared.c:587, ret=-5 (Input/output error)
> 0: fi_getinfo(): common/shared.c:454, ret=-61 (No data available)
> srun: error: nid00091: task 0: Exited with exit code 61
> srun: Terminating job step 786161.55
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: nid00092: task 1: Exited with exit code 5
> 
> 
> daint102:/scratch/snx3000/biddisco/build$ ./frun.sh ~/apps/fabtests/bin/fi_rdm_rma_simple
> running /users/biddisco/apps/fabtests/bin/fi_rdm_rma_simple   on nid000[91-92]
> nid00091 is 148.187.32.92
> Generated command is  srun -n 2 --ntasks-per-node=1 -l --multi-prog 
> ./scalable.conf
> 0 /users/biddisco/apps/fabtests/bin/fi_rdm_rma_simple -p gni
> 1 /users/biddisco/apps/fabtests/bin/fi_rdm_rma_simple -p gni   nid00091
> 
> 1: fi_getinfo(): common/shared.c:454, ret=-61 (No data available)
> srun: error: nid00092: task 1: Exited with exit code 61
> srun: Terminating job step 786161.57
> 0: slurmstepd: error: *** STEP 786161.57 ON nid00091 CANCELLED AT 
> 2017-02-13T16:20:18 ***
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: nid00091: task 0: Killed
> 
> daint102:/scratch/snx3000/biddisco/build$ ./frun.sh ~/apps/fabtests/bin/fi_msg_bw
> running /users/biddisco/apps/fabtests/bin/fi_msg_bw   on nid000[91-92]
> nid00091 is 148.187.32.92
> Generated command is  srun -n 2 --ntasks-per-node=1 -l --multi-prog 
> ./scalable.conf
> 0 /users/biddisco/apps/fabtests/bin/fi_msg_bw -p gni
> 1 /users/biddisco/apps/fabtests/bin/fi_msg_bw -p gni   nid00091
> 
> 0: fi_getinfo(): common/shared.c:454, ret=-61 (No data available)
> 1: fi_connect(): common/shared.c:587, ret=-5 (Input/output error)
> srun: error: nid00091: task 0: Exited with exit code 61
> srun: Terminating job step 786161.59
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: nid00092: task 1: Exited with exit code 5
> 
> 
_______________________________________________
Libfabric-users mailing list
Libfabric-users at lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/libfabric-users


More information about the Libfabric-users mailing list