[libfabric-users] FI_EP_MSG on cray

Howard Pritchard hppritcha at gmail.com
Tue Feb 14 14:45:05 PST 2017


Hi John,

This is odd.  Could you check what resource limits you're getting when you
salloc a node:

salloc -N 1 --exclusive

srun -n 1 cat  /sys/class/gni/kgni0/resources


and post the output?  You should be seeing something like:

--- PTag: 104 PKey: 0x1734 JobId: 0x900000d62 RefCount: 1 Suspend: Idle ---

Name       Used            Limit           HWM

MDD        0               1806            0
CQ         0               495
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               134217728
PCI-IOMMU  0               -1
CE         0               1
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1

--- PTag: 105 PKey: 0x1735 JobId: 0x900000d62 RefCount: 1 Suspend: Idle ---

Name       Used            Limit           HWM

MDD        0               1806            0
CQ         0               495
FMA        0               123
SFMA       0               123
RDMA       0               -1
DIRECT     0               -1
IOMMU      0               134217728
PCI-IOMMU  0               -1
CE         0               1
DLA        0               15360
non-VMDH   0               -1
SMDD Hold  0               -1

Howard

2017-02-14 14:36 GMT-07:00 Biddiscombe, John A. <biddisco at cscs.ch>:

> Yes, the version I’m using is just as is found here
>
>
>
> https://github.com/ofi-cray/libfabric-cray/blob/master/
> prov/gni/test/run_gnitest#L42
>
>
>
> so it should be set
>
>
>
> JB
>
>
>
> *From: *Howard Pritchard <hppritcha at gmail.com>
> *Date: *Tuesday, 14 February 2017 at 21:05
>
> *To: *John Biddiscombe <biddisco at cscs.ch>
> *Cc: *"libfabric-users at lists.openfabrics.org" <libfabric-users at lists.
> openfabrics.org>
> *Subject: *Re: [libfabric-users] FI_EP_MSG on cray
>
>
>
> Hi John,
>
>
>
> On the Cray with KNL+omnipath you'll end up using the PSM2 provider.
>
> Could you double check that your copy of run_gnitest has
>
>
>
> export UGNI_FMA_SHARED=0
>
> is being set?
>
> Howard
>
>
>
>
>
> 2017-02-14 12:10 GMT-07:00 Biddiscombe, John A. <biddisco at cscs.ch>:
>
> The nodes I requested (in exclusive mode) are, non gpu nodes.
>
> XC40 Compute Nodes
> Intel® Xeon® E5-2695 v4 @ 2.10GHz (2 x 18 cores, 64/128 GB RAM) (daint-mc)
>
> my slurm allocation uses
>
> salloc -N 2 -C mc --time=02:00:00 --exclusive
>
> I can try on  GPU nodes to see if anything is different.
>
> I have another cray with KNL + omnipath I’ll test on just out of curiosity.
>
> JB
>
> From: Howard Pritchard <hppritcha at gmail.com>
> Date: Tuesday, 14 February 2017 at 19:47
> To: John Biddiscombe <biddisco at cscs.ch>
> Cc: "libfabric-users at lists.openfabrics.org" <libfabric-users at lists.
> openfabrics.org>
>
> Subject: Re: [libfabric-users] FI_EP_MSG on cray
>
> HI John,
>
> These messages look like the type you get if you don't have exclusive
> access to the node.  Does your system use
> ALPS or SLURM?  There's another factor as well,  do these nodes have
> GPUs?  This may impact your jobs Aries hw resource limits.  We don't
> typically test libfabric on Cray XC nodes with GPUs.
>
> Howard
>
>
> 2017-02-14 7:26 GMT-07:00 Biddiscombe, John A. <biddisco at cscs.ch>:
> Sorry Howard, this went into my spam folder and I missed it.
>
> I have run the gni test and it creates rather a lot of output when debug
> is enabled.
> I’ve put the output (800MB) here ftp://ftp.cscs.ch/out/
> biddisco/cray/gnitestout.txt
>
> The synopsis is
> [====] Synthesis: Tested: 631 | Passing: 573 | Failing: 58 | Crashing: 57
>
> with the majority of errors being of the form
> [   240]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)  returned error -
> Invalid argument
> with occasional
> [   240]   JOB: GNI_CdmAttach: ioctl(GNI_IOC_NIC_SETATTR) NIC[0] returned
> error - No space left on device
>
> (but from what I read the gnitest only runs on one node, so it may not be
> much use).
>
> Thanks for taking the time to investigate.
>
> PS. I forgot to ask - if the FI_EP_MSG or gni is due in 1.5.0 then what
> sort of timescale would one expect that to be in?
>
> JB
>
> From: Howard Pritchard [mailto:hppritcha at gmail.com]
> Sent: 14 February 2017 00:08
> To: Biddiscombe, John A.
> Cc: libfabric-users at lists.openfabrics.org
> Subject: Re: [libfabric-users] FI_EP_MSG on cray
>
> Hi John,
>
> Could you try the run_gnitest script with this UGNI debug level set?   I'd
> like to understand why that's failing for you.
>
> I cannot get fi_pingpong to work with FI_EP_MSG for GNI provider.  It
> should work though.  I filed an issue on the GNI downstream provider repo.
>
> Howard
>
>
>
>
>
> 2017-02-13 13:21 GMT-07:00 Biddiscombe, John A. <biddisco at cscs.ch>:
> Howard, here’s some output …
>
> The machine is the cray piz daint at CSCS,
>
> Allocation as follows
>
> salloc -N 2 -C mc --time=02:00:00 –exclusive
>
> daint102:/scratch/snx3000/biddisco/build$ export UGNI_DEBUG=10
> daint102:/scratch/snx3000/biddisco/build$ ./frun.sh
> ~/apps/fabtests/bin/fi_msg
> running /users/biddisco/apps/fabtests/bin/fi_msg   on nid00[722,724]
> nid00722 is 148.187.34.215
> Generated command is  srun -n 2 --ntasks-per-node=1 -l --multi-prog
> ./scalable.conf
> 0 /users/biddisco/apps/fabtests/bin/fi_msg -p gni
> 1 /users/biddisco/apps/fabtests/bin/fi_msg -p gni   148.187.34.215
>
> 0: [    44] GNII_DebugInit: GNII_debug_level: 10 GNII_subsys_debug: 0
> GNII_debug_mask: 0x0 GNII_debug_inst_id: 44
> 0: [    44]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor
> 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
> 0: [    44]   JOB: GNI_GetJobResInfo: job resource: FMA (6) used: 0 limit:
> 123
> 0: [    44]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor
> 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
> 0: [    44]   JOB: GNI_GetJobResInfo: job resource: CQ (5) used: 0 limit:
> 509
> 0: fi_getinfo(): common/shared.c:454, ret=-61 (No data available)
> 1: [    36] GNII_DebugInit: GNII_debug_level: 10 GNII_subsys_debug: 0
> GNII_debug_mask: 0x0 GNII_debug_inst_id: 36
> 1: [    36]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor
> 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
> 1: [    36]   JOB: GNI_GetJobResInfo: job resource: FMA (6) used: 0 limit:
> 123
> 1: [    36]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor
> 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
> 1: [    36]   JOB: GNI_GetJobResInfo: job resource: CQ (5) used: 0 limit:
> 509
> 1: [    36]   JOB: GNII_GetKernelVersion: kgni version major = 0x0 minor
> 0x45 code 0xb9 built with major = 0x0 minor = 0x45 code 0x4e24
> 1: [    36]   FMA: GNI_CdmAttach: FMA window size: 32768
> 1: [    36]   FMA: GNI_CdmAttach: NOPRIV_ERR masked
> 1: [    36]   JOB: GNI_CdmAttach: ptag = 36 inst_id = 13864961 fma_window
> = 0x0000000000000000 fma_ctrl = 0x0000000000000000
> 1: [    36]    CQ: GNI_CqCreate: entry_count: 1361 reqs: 1361 adjusted
> entries: 1395 alloc_count: 1396
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
> 1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 1, mode = 2,
> rd_index_ptr = 0x2aaaaaad7ba0, queue = 0x2aaaaaad5000, intr_mask = (nil)
> 1: [    36]    CQ: GNI_CqCreate: entry_count: 1361 reqs: 1361 adjusted
> entries: 1395 alloc_count: 1396
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
> 1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 394, mode =
> 20, rd_index_ptr = 0x2aaaaaadc000, queue = 0x2aaaaaad8000, intr_mask = (nil)
> 1: [    36] FLBTE: GNII_FlbteInit: FLBTE: tx_counter 0x2aaaaaace008, chan
> 2, max_len -1, total 511
> 1: [    36]    CQ: GNI_CqCreate: entry_count: 2048 reqs: 2048 adjusted
> entries: 2559 alloc_count: 2560
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
> 1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 395, mode =
> 4, rd_index_ptr = 0x2aaaaaae6000, queue = 0x2aaaaaade000, intr_mask = (nil)
> 1: [    36]    CQ: GNI_CqCreate: entry_count: 2048 reqs: 2048 adjusted
> entries: 2559 alloc_count: 2560
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
> 1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 396, mode =
> 5, rd_index_ptr = 0x2aaaaaaef000, queue = 0x2aaaaaae7000, intr_mask =
> 0x2aaaaaacf000
> 1: [    36]    CQ: GNI_CqCreate: entry_count: 16384 reqs: 16384 adjusted
> entries: 16895 alloc_count: 16896
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)  returned error -
> Invalid argument
> 1: [    36]    CQ: GNI_CqCreate: GNI_IOC_CQ_CREATE with PHYS_MEM failed
> trying without PHYS_MEM
> 1: [    36]    MR: GNI_MemRegister: Mem reg of 135168 length at addr
> 0x2aaaaaaf0000
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
> 1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 397, mode =
> 0, rd_index_ptr = 0x2aaaaab11000, queue = 0x2aaaaaaf0000, intr_mask = (nil)
> 1: [    36]    CQ: GNI_CqCreate: entry_count: 16384 reqs: 16384 adjusted
> entries: 16895 alloc_count: 16896
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)  returned error -
> Invalid argument
> 1: [    36]    CQ: GNI_CqCreate: GNI_IOC_CQ_CREATE with PHYS_MEM failed
> trying without PHYS_MEM
> 1: [    36]    MR: GNI_MemRegister: Mem reg of 135168 length at addr
> 0x2aaaaab23000
> 1: [    36]    CQ: cq_create: ioctl(GNI_IOC_CQ_CREATE)
> 1: [    36]    CQ: cq_create: #1 cq created, kern_cq_descr = 398, mode =
> 1, rd_index_ptr = 0x2aaaaab12000, queue = 0x2aaaaab23000, intr_mask =
> 0x2aaaaaacf004
> 1: [    36]    MR: GNI_MemRegister: Mem reg of 136314880 length at addr
> 0x2aaaae400000
> srun: error: nid00722: task 0: Exited with exit code 61
> srun: Terminating job step 789872.11
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: nid00724: task 1: Killed
> daint102:/scratch/snx3000/biddisco/build$
>
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20170214/2be2f33d/attachment.html>


More information about the Libfabric-users mailing list