[libfabric-users] FI_EP_MSG on cray

Howard Pritchard hppritcha at gmail.com
Tue Feb 14 10:47:03 PST 2017


HI John,

These messages look like the type you get if you don't have exclusive
access to the node.  Does your system use
ALPS or SLURM?  There's another factor as well,  do these nodes have GPUs?
This may impact your jobs Aries hw resource limits.  We don't typically
test libfabric on Cray XC nodes with GPUs.

Howard


2017-02-14 7:26 GMT-07:00 Biddiscombe, John A. <biddisco at cscs.ch>:

> Sorry Howard, this went into my spam folder and I missed it.
>
>
>
> I have run the gni test and it creates rather a lot of output when debug
> is enabled.
>
> I’ve put the output (800MB) here ftp://ftp.cscs.ch/out/
> biddisco/cray/gnitestout.txt
>
>
>
> The synopsis is
>
> [====] Synthesis: Tested: 631 | Passing: 573 | Failing: 58 | Crashing: 57
>
>
>
> with the majority of errors being of the form
>
> [   240]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)  returned error -
> Invalid argument
>
> with occasional
>
> [   240]   JOB: GNI_CdmAttach: ioctl(GNI_IOC_NIC_SETATTR) NIC[0] returned
> error - No space left on device
>
>
>
> (but from what I read the gnitest only runs on one node, so it may not be
> much use).
>
>
>
> Thanks for taking the time to investigate.
>
>
>
> PS. I forgot to ask - if the FI_EP_MSG or gni is due in 1.5.0 then what
> sort of timescale would one expect that to be in?
>
>
>
> JB
>
>
>
> *From:* Howard Pritchard [mailto:hppritcha at gmail.com]
> *Sent:* 14 February 2017 00:08
> *To:* Biddiscombe, John A.
> *Cc:* libfabric-users at lists.openfabrics.org
> *Subject:* Re: [libfabric-users] FI_EP_MSG on cray
>
>
>
> Hi John,
>
>
>
> Could you try the run_gnitest script with this UGNI debug level set?
> I'd like to understand why that's failing for you.
>
>
>
> I cannot get fi_pingpong to work with FI_EP_MSG for GNI provider.  It
> should work though.  I filed an issue on the GNI downstream provider repo.
>
>
>
> Howard
>
>
>
>
>
>
>
>
>
>
>
> 2017-02-13 13:21 GMT-07:00 Biddiscombe, John A. <biddisco at cscs.ch>:
>
> Howard, here’s some output …
>
>
>
> The machine is the cray piz daint at CSCS,
>
>
>
> Allocation as follows
>
>
>
> salloc -N 2 -C mc --time=02:00:00 –exclusive
>
>
>
> daint102:/scratch/snx3000/biddisco/build$ export UGNI_DEBUG=10
>
> daint102:/scratch/snx3000/biddisco/build$ ./frun.sh
> ~/apps/fabtests/bin/fi_msg
>
> running /users/biddisco/apps/fabtests/bin/fi_msg   on nid00[722,724]
>
> nid00722 is 148.187.34.215
>
> Generated command is  srun -n 2 --ntasks-per-node=1 -l --multi-prog
> ./scalable.conf
>
> 0 /users/biddisco/apps/fabtests/bin/fi_msg -p gni
>
> 1 /users/biddisco/apps/fabtests/bin/fi_msg -p gni   148.187.34.215
>
>
>
> 0: [    44] GNII_DebugInit: GNII_debug_level: 10 GNII_subsys_debug: 0
> GNII_debug_mask: 0x0 GNII_debug_inst_id: 44
>
> 0: [    44]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor
> 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
>
> 0: [    44]   JOB: GNI_GetJobResInfo: job resource: FMA (6) used: 0 limit:
> 123
>
> 0: [    44]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor
> 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
>
> 0: [    44]   JOB: GNI_GetJobResInfo: job resource: CQ (5) used: 0 limit:
> 509
>
> 0: fi_getinfo(): common/shared.c:454, ret=-61 (No data available)
>
> 1: [    36] GNII_DebugInit: GNII_debug_level: 10 GNII_subsys_debug: 0
> GNII_debug_mask: 0x0 GNII_debug_inst_id: 36
>
> 1: [    36]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor
> 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
>
> 1: [    36]   JOB: GNI_GetJobResInfo: job resource: FMA (6) used: 0 limit:
> 123
>
> 1: [    36]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor
> 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
>
> 1: [    36]   JOB: GNI_GetJobResInfo: job resource: CQ (5) used: 0 limit:
> 509
>
> 1: [    36]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor
> 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
>
> 1: [    36]   FMA: GNI_CdmAttach: FMA window size: 32768
>
> 1: [    36]   FMA: GNI_CdmAttach: NOPRIV_ERR masked
>
> 1: [    36]   JOB: GNI_CdmAttach: ptag = 36 inst_id = 13864961 fma_window
> = 0x0000000000000000 fma_ctrl = 0x0000000000000000
>
> 1: [    36]    CQ: GNI_CqCreate: entry_count: 1361 reqs: 1361 adjusted
> entries: 1395 alloc_count: 1396
>
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
>
> 1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 1, mode = 2,
> rd_index_ptr = 0x2aaaaaad7ba0, queue = 0x2aaaaaad5000, intr_mask = (nil)
>
> 1: [    36]    CQ: GNI_CqCreate: entry_count: 1361 reqs: 1361 adjusted
> entries: 1395 alloc_count: 1396
>
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
>
> 1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 394, mode =
> 20, rd_index_ptr = 0x2aaaaaadc000, queue = 0x2aaaaaad8000, intr_mask = (nil)
>
> 1: [    36] FLBTE: GNII_FlbteInit: FLBTE: tx_counter 0x2aaaaaace008, chan
> 2, max_len -1, total 511
>
> 1: [    36]    CQ: GNI_CqCreate: entry_count: 2048 reqs: 2048 adjusted
> entries: 2559 alloc_count: 2560
>
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
>
> 1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 395, mode =
> 4, rd_index_ptr = 0x2aaaaaae6000, queue = 0x2aaaaaade000, intr_mask = (nil)
>
> 1: [    36]    CQ: GNI_CqCreate: entry_count: 2048 reqs: 2048 adjusted
> entries: 2559 alloc_count: 2560
>
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
>
> 1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 396, mode =
> 5, rd_index_ptr = 0x2aaaaaaef000, queue = 0x2aaaaaae7000, intr_mask =
> 0x2aaaaaacf000
>
> 1: [    36]    CQ: GNI_CqCreate: entry_count: 16384 reqs: 16384 adjusted
> entries: 16895 alloc_count: 16896
>
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)  returned error -
> Invalid argument
>
> 1: [    36]    CQ: GNI_CqCreate: GNI_IOC_CQ_CREATE with PHYS_MEM failed
> trying without PHYS_MEM
>
> 1: [    36]    MR: GNI_MemRegister: Mem reg of 135168 length at addr
> 0x2aaaaaaf0000
>
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
>
> 1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 397, mode =
> 0, rd_index_ptr = 0x2aaaaab11000, queue = 0x2aaaaaaf0000, intr_mask = (nil)
>
> 1: [    36]    CQ: GNI_CqCreate: entry_count: 16384 reqs: 16384 adjusted
> entries: 16895 alloc_count: 16896
>
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)  returned error -
> Invalid argument
>
> 1: [    36]    CQ: GNI_CqCreate: GNI_IOC_CQ_CREATE with PHYS_MEM failed
> trying without PHYS_MEM
>
> 1: [    36]    MR: GNI_MemRegister: Mem reg of 135168 length at addr
> 0x2aaaaab23000
>
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
>
> 1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 398, mode =
> 1, rd_index_ptr = 0x2aaaaab12000, queue = 0x2aaaaab23000, intr_mask =
> 0x2aaaaaacf004
>
> 1: [    36]    MR: GNI_MemRegister: Mem reg of 136314880 length at addr
> 0x2aaaae400000
>
> srun: error: nid00722: task 0: Exited with exit code 61
>
> srun: Terminating job step 789872.11
>
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>
> srun: error: nid00724: task 1: Killed
>
> daint102:/scratch/snx3000/biddisco/build$
>
>
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20170214/5319c80e/attachment.html>


More information about the Libfabric-users mailing list