[libfabric-users] FI_EP_MSG on cray

Biddiscombe, John A. biddisco at cscs.ch
Tue Feb 14 06:26:37 PST 2017


Sorry Howard, this went into my spam folder and I missed it.

I have run the gni test and it creates rather a lot of output when debug is enabled.
I’ve put the output (800MB) here ftp://ftp.cscs.ch/out/biddisco/cray/gnitestout.txt

The synopsis is
[====] Synthesis: Tested: 631 | Passing: 573 | Failing: 58 | Crashing: 57

with the majority of errors being of the form
[   240]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)  returned error - Invalid argument
with occasional
[   240]   JOB: GNI_CdmAttach: ioctl(GNI_IOC_NIC_SETATTR) NIC[0] returned error - No space left on device

(but from what I read the gnitest only runs on one node, so it may not be much use).

Thanks for taking the time to investigate.

PS. I forgot to ask - if the FI_EP_MSG or gni is due in 1.5.0 then what sort of timescale would one expect that to be in?

JB

From: Howard Pritchard [mailto:hppritcha at gmail.com]
Sent: 14 February 2017 00:08
To: Biddiscombe, John A.
Cc: libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] FI_EP_MSG on cray

Hi John,

Could you try the run_gnitest script with this UGNI debug level set?   I'd like to understand why that's failing for you.

I cannot get fi_pingpong to work with FI_EP_MSG for GNI provider.  It should work though.  I filed an issue on the GNI downstream provider repo.

Howard





2017-02-13 13:21 GMT-07:00 Biddiscombe, John A. <biddisco at cscs.ch<mailto:biddisco at cscs.ch>>:
Howard, here’s some output …

The machine is the cray piz daint at CSCS,

Allocation as follows

salloc -N 2 -C mc --time=02:00:00 –exclusive

daint102:/scratch/snx3000/biddisco/build$ export UGNI_DEBUG=10
daint102:/scratch/snx3000/biddisco/build$ ./frun.sh ~/apps/fabtests/bin/fi_msg
running /users/biddisco/apps/fabtests/bin/fi_msg   on nid00[722,724]
nid00722 is 148.187.34.215
Generated command is  srun -n 2 --ntasks-per-node=1 -l --multi-prog ./scalable.conf
0 /users/biddisco/apps/fabtests/bin/fi_msg -p gni
1 /users/biddisco/apps/fabtests/bin/fi_msg -p gni   148.187.34.215

0: [    44] GNII_DebugInit: GNII_debug_level: 10 GNII_subsys_debug: 0 GNII_debug_mask: 0x0 GNII_debug_inst_id: 44
0: [    44]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
0: [    44]   JOB: GNI_GetJobResInfo: job resource: FMA (6) used: 0 limit: 123
0: [    44]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
0: [    44]   JOB: GNI_GetJobResInfo: job resource: CQ (5) used: 0 limit: 509
0: fi_getinfo(): common/shared.c:454, ret=-61 (No data available)
1: [    36] GNII_DebugInit: GNII_debug_level: 10 GNII_subsys_debug: 0 GNII_debug_mask: 0x0 GNII_debug_inst_id: 36
1: [    36]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
1: [    36]   JOB: GNI_GetJobResInfo: job resource: FMA (6) used: 0 limit: 123
1: [    36]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
1: [    36]   JOB: GNI_GetJobResInfo: job resource: CQ (5) used: 0 limit: 509
1: [    36]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
1: [    36]   FMA: GNI_CdmAttach: FMA window size: 32768
1: [    36]   FMA: GNI_CdmAttach: NOPRIV_ERR masked
1: [    36]   JOB: GNI_CdmAttach: ptag = 36 inst_id = 13864961 fma_window = 0x0000000000000000 fma_ctrl = 0x0000000000000000
1: [    36]    CQ: GNI_CqCreate: entry_count: 1361 reqs: 1361 adjusted entries: 1395 alloc_count: 1396
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 1, mode = 2, rd_index_ptr = 0x2aaaaaad7ba0, queue = 0x2aaaaaad5000, intr_mask = (nil)
1: [    36]    CQ: GNI_CqCreate: entry_count: 1361 reqs: 1361 adjusted entries: 1395 alloc_count: 1396
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 394, mode = 20, rd_index_ptr = 0x2aaaaaadc000, queue = 0x2aaaaaad8000, intr_mask = (nil)
1: [    36] FLBTE: GNII_FlbteInit: FLBTE: tx_counter 0x2aaaaaace008, chan 2, max_len -1, total 511
1: [    36]    CQ: GNI_CqCreate: entry_count: 2048 reqs: 2048 adjusted entries: 2559 alloc_count: 2560
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 395, mode = 4, rd_index_ptr = 0x2aaaaaae6000, queue = 0x2aaaaaade000, intr_mask = (nil)
1: [    36]    CQ: GNI_CqCreate: entry_count: 2048 reqs: 2048 adjusted entries: 2559 alloc_count: 2560
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 396, mode = 5, rd_index_ptr = 0x2aaaaaaef000, queue = 0x2aaaaaae7000, intr_mask = 0x2aaaaaacf000
1: [    36]    CQ: GNI_CqCreate: entry_count: 16384 reqs: 16384 adjusted entries: 16895 alloc_count: 16896
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)  returned error - Invalid argument
1: [    36]    CQ: GNI_CqCreate: GNI_IOC_CQ_CREATE with PHYS_MEM failed trying without PHYS_MEM
1: [    36]    MR: GNI_MemRegister: Mem reg of 135168 length at addr 0x2aaaaaaf0000
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 397, mode = 0, rd_index_ptr = 0x2aaaaab11000, queue = 0x2aaaaaaf0000, intr_mask = (nil)
1: [    36]    CQ: GNI_CqCreate: entry_count: 16384 reqs: 16384 adjusted entries: 16895 alloc_count: 16896
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)  returned error - Invalid argument
1: [    36]    CQ: GNI_CqCreate: GNI_IOC_CQ_CREATE with PHYS_MEM failed trying without PHYS_MEM
1: [    36]    MR: GNI_MemRegister: Mem reg of 135168 length at addr 0x2aaaaab23000
1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 398, mode = 1, rd_index_ptr = 0x2aaaaab12000, queue = 0x2aaaaab23000, intr_mask = 0x2aaaaaacf004
1: [    36]    MR: GNI_MemRegister: Mem reg of 136314880 length at addr 0x2aaaae400000
srun: error: nid00722: task 0: Exited with exit code 61
srun: Terminating job step 789872.11
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: nid00724: task 1: Killed
daint102:/scratch/snx3000/biddisco/build$

_______________________________________________
Libfabric-users mailing list
Libfabric-users at lists.openfabrics.org<mailto:Libfabric-users at lists.openfabrics.org>
http://lists.openfabrics.org/mailman/listinfo/libfabric-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20170214/70178e51/attachment.html>


More information about the Libfabric-users mailing list