[libfabric-users] FI_EP_MSG on cray

Biddiscombe, John A. biddisco at cscs.ch
Tue Feb 14 11:10:05 PST 2017


The nodes I requested (in exclusive mode) are, non gpu nodes.

XC40 Compute Nodes 
Intel® Xeon® E5-2695 v4 @ 2.10GHz (2 x 18 cores, 64/128 GB RAM) (daint-mc)

my slurm allocation uses 

salloc -N 2 -C mc --time=02:00:00 --exclusive

I can try on  GPU nodes to see if anything is different.

I have another cray with KNL + omnipath I’ll test on just out of curiosity.

JB

From: Howard Pritchard <hppritcha at gmail.com>
Date: Tuesday, 14 February 2017 at 19:47
To: John Biddiscombe <biddisco at cscs.ch>
Cc: "libfabric-users at lists.openfabrics.org" <libfabric-users at lists.openfabrics.org>
Subject: Re: [libfabric-users] FI_EP_MSG on cray

HI John, 

These messages look like the type you get if you don't have exclusive access to the node.  Does your system use
ALPS or SLURM?  There's another factor as well,  do these nodes have GPUs?  This may impact your jobs Aries hw resource limits.  We don't typically test libfabric on Cray XC nodes with GPUs.

Howard


2017-02-14 7:26 GMT-07:00 Biddiscombe, John A. <biddisco at cscs.ch>:
Sorry Howard, this went into my spam folder and I missed it.
 
I have run the gni test and it creates rather a lot of output when debug is enabled.
I’ve put the output (800MB) here ftp://ftp.cscs.ch/out/biddisco/cray/gnitestout.txt
 
The synopsis is 
[====] Synthesis: Tested: 631 | Passing: 573 | Failing: 58 | Crashing: 57
 
with the majority of errors being of the form
[   240]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)  returned error - Invalid argument
with occasional
[   240]   JOB: GNI_CdmAttach: ioctl(GNI_IOC_NIC_SETATTR) NIC[0] returned error - No space left on device
 
(but from what I read the gnitest only runs on one node, so it may not be much use).
 
Thanks for taking the time to investigate.
 
PS. I forgot to ask - if the FI_EP_MSG or gni is due in 1.5.0 then what sort of timescale would one expect that to be in?
 
JB
 
From: Howard Pritchard [mailto:hppritcha at gmail.com] 
Sent: 14 February 2017 00:08
To: Biddiscombe, John A.
Cc: libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] FI_EP_MSG on cray
 
Hi John,
 
Could you try the run_gnitest script with this UGNI debug level set?   I'd like to understand why that's failing for you.
 
I cannot get fi_pingpong to work with FI_EP_MSG for GNI provider.  It should work though.  I filed an issue on the GNI downstream provider repo.
 
Howard
 
 
 
 
 
2017-02-13 13:21 GMT-07:00 Biddiscombe, John A. <biddisco at cscs.ch>:
Howard, here’s some output …
 
The machine is the cray piz daint at CSCS,
 
Allocation as follows
 
salloc -N 2 -C mc --time=02:00:00 –exclusive
 
daint102:/scratch/snx3000/biddisco/build$ export UGNI_DEBUG=10
daint102:/scratch/snx3000/biddisco/build$ ./frun.sh ~/apps/fabtests/bin/fi_msg
running /users/biddisco/apps/fabtests/bin/fi_msg   on nid00[722,724]
nid00722 is 148.187.34.215
Generated command is  srun -n 2 --ntasks-per-node=1 -l --multi-prog ./scalable.conf
0 /users/biddisco/apps/fabtests/bin/fi_msg -p gni
1 /users/biddisco/apps/fabtests/bin/fi_msg -p gni   148.187.34.215
 
0: [    44] GNII_DebugInit: GNII_debug_level: 10 GNII_subsys_debug: 0 GNII_debug_mask: 0x0 GNII_debug_inst_id: 44
0: [    44]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
0: [    44]   JOB: GNI_GetJobResInfo: job resource: FMA (6) used: 0 limit: 123
0: [    44]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
0: [    44]   JOB: GNI_GetJobResInfo: job resource: CQ (5) used: 0 limit: 509
0: fi_getinfo(): common/shared.c:454, ret=-61 (No data available)
1: [    36] GNII_DebugInit: GNII_debug_level: 10 GNII_subsys_debug: 0 GNII_debug_mask: 0x0 GNII_debug_inst_id: 36
1: [    36]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
1: [    36]   JOB: GNI_GetJobResInfo: job resource: FMA (6) used: 0 limit: 123
1: [    36]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
1: [    36]   JOB: GNI_GetJobResInfo: job resource: CQ (5) used: 0 limit: 509
1: [    36]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
1: [    36]   FMA: GNI_CdmAttach: FMA window size: 32768
1: [    36]   FMA: GNI_CdmAttach: NOPRIV_ERR masked
1: [    36]   JOB: GNI_CdmAttach: ptag = 36 inst_id = 13864961 fma_window = 0x0000000000000000 fma_ctrl = 0x0000000000000000
1: [    36]    CQ: GNI_CqCreate: entry_count: 1361 reqs: 1361 adjusted entries: 1395 alloc_count: 1396
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 1, mode = 2, rd_index_ptr = 0x2aaaaaad7ba0, queue = 0x2aaaaaad5000, intr_mask = (nil)
1: [    36]    CQ: GNI_CqCreate: entry_count: 1361 reqs: 1361 adjusted entries: 1395 alloc_count: 1396
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 394, mode = 20, rd_index_ptr = 0x2aaaaaadc000, queue = 0x2aaaaaad8000, intr_mask = (nil)
1: [    36] FLBTE: GNII_FlbteInit: FLBTE: tx_counter 0x2aaaaaace008, chan 2, max_len -1, total 511
1: [    36]    CQ: GNI_CqCreate: entry_count: 2048 reqs: 2048 adjusted entries: 2559 alloc_count: 2560
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 395, mode = 4, rd_index_ptr = 0x2aaaaaae6000, queue = 0x2aaaaaade000, intr_mask = (nil)
1: [    36]    CQ: GNI_CqCreate: entry_count: 2048 reqs: 2048 adjusted entries: 2559 alloc_count: 2560
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 396, mode = 5, rd_index_ptr = 0x2aaaaaaef000, queue = 0x2aaaaaae7000, intr_mask = 0x2aaaaaacf000
1: [    36]    CQ: GNI_CqCreate: entry_count: 16384 reqs: 16384 adjusted entries: 16895 alloc_count: 16896
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)  returned error - Invalid argument
1: [    36]    CQ: GNI_CqCreate: GNI_IOC_CQ_CREATE with PHYS_MEM failed trying without PHYS_MEM
1: [    36]    MR: GNI_MemRegister: Mem reg of 135168 length at addr 0x2aaaaaaf0000
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 397, mode = 0, rd_index_ptr = 0x2aaaaab11000, queue = 0x2aaaaaaf0000, intr_mask = (nil)
1: [    36]    CQ: GNI_CqCreate: entry_count: 16384 reqs: 16384 adjusted entries: 16895 alloc_count: 16896
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)  returned error - Invalid argument
1: [    36]    CQ: GNI_CqCreate: GNI_IOC_CQ_CREATE with PHYS_MEM failed trying without PHYS_MEM
1: [    36]    MR: GNI_MemRegister: Mem reg of 135168 length at addr 0x2aaaaab23000
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 398, mode = 1, rd_index_ptr = 0x2aaaaab12000, queue = 0x2aaaaab23000, intr_mask = 0x2aaaaaacf004
1: [    36]    MR: GNI_MemRegister: Mem reg of 136314880 length at addr 0x2aaaae400000
srun: error: nid00722: task 0: Exited with exit code 61
srun: Terminating job step 789872.11
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: nid00724: task 1: Killed
daint102:/scratch/snx3000/biddisco/build$

_______________________________________________
Libfabric-users mailing list
Libfabric-users at lists.openfabrics.org
http://lists.openfabrics.org/mailman/listinfo/libfabric-users
 





More information about the Libfabric-users mailing list