[libfabric-users] Error allocating domain
Biddiscombe, John A.
biddisco at cscs.ch
Wed Jun 17 15:00:30 PDT 2020
Phil - thanks for this info. I will experiment with it. I'm actually away for a few days from tomorrow, so it'll be next week before I get a chance, but I'll report back if I have success (or more problems).
JB
________________________________
From: Carns, Philip H. <carns at mcs.anl.gov>
Sent: 17 June 2020 19:52:42
To: Biddiscombe, John A.; Howard Pritchard
Cc: libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Error allocating domain
Hi John,
I know your question is aimed at Howard, but I can offer another data point and an example of a software stack working around this. I've never gotten kdreg to work in executables that are also using Cray's MPI; they conflict. If you want to use udreg as an alternative, then you'll need to do two things:
a) disable kdreg support in libfabric at build time (as in this spack package here: https://github.com/spack/spack/blob/develop/var/spack/repos/builtin/packages/libfabric/package.py#L94)
b) explicitly enable and configure udreg outside of libfabric (as in the Mercury libfabric plugin here: https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c#L1778)
This configuration is stable for us and works fine whether Cray MPI is present or not. I'll defer to Howard about the technical implications, though 🙂
thanks,
-Phil
________________________________
From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Biddiscombe, John A. <biddisco at cscs.ch>
Sent: Wednesday, June 17, 2020 1:32 PM
To: Howard Pritchard <hppritcha at gmail.com>
Cc: libfabric-users at lists.openfabrics.org <libfabric-users at lists.openfabrics.org>
Subject: Re: [libfabric-users] Error allocating domain
Howard
From the phrasing "You are hitting a limitation with the ancient kdreg device driver. It may be best to not use it for your libfabric app." is there anything I can do about it. I can see that there is a udreg directory in /opt/cray - is there anything I can replace the kdreg stuff with?
Thanks
JB
________________________________
From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Biddiscombe, John A. <biddisco at cscs.ch>
Sent: 17 June 2020 17:26:29
To: Howard Pritchard
Cc: libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Error allocating domain
my config line has always been this (apart from the debug). It has worked for several years until a recent system maintenance.change or something of that kind. (Nobody here claims to have changed anything significant)
./configure --disable-verbs --disable-sockets --disable-usnic --disable-udp --disable-rxm --disable-rxd --disable-shm --disable-mrail --disable-tcp --disable-perf --disable-rstream --enable-gni --prefix=/apps/daint/UES/biddisco/gcc/8.3.0/libfabric CC=/opt/cray/pe/craype/default/bin/cc CFLAGS=-fPIC LDFLAGS=-ldl --no-recursion --enable-debug
JB
________________________________
From: Howard Pritchard <hppritcha at gmail.com>
Sent: 17 June 2020 17:20:21
To: Biddiscombe, John A.
Cc: libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Error allocating domain
Hi John,
You are hitting a limitation with the ancient kdreg device driver. It may be best to not use it for your libfabric app. What are the configure options you're using for building libfabric?
Howard
Am Di., 16. Juni 2020 um 10:34 Uhr schrieb Biddiscombe, John A. <biddisco at cscs.ch<mailto:biddisco at cscs.ch>>:
I've got this log when I dump out my own messages, and also enable debugging in libfabric - can anyone tell what's wrong from the message. Code that used to work seems to have stopped. I upgraded to libfabric 1.10.1 tag and rebuilt, but it didn't change.
The only thing that springs to mind is that the application is also using MPI on the cray at the same time, so when this code is called, mpi_init would have already been called, and perhaps somehow the nic is inaccessible - hence the error. I'm sure it used to work - and if I use ranks = 1 - it runs - so perhaps mpi detects just one rank and does no initialization, but when I use N>1 ranks, it dies. Any suggestions welcome. Thanks
JB
<DEB> 0000056511 0x2aaaaab2dec0 cpu 000 nid00219(0) CONTROL Allocating domain
libfabric:69061:gni:core:_gnix_ref_init():254<debug> [69061:1] 0x8579d8 refs 1
libfabric:69061:core:core:fi_fabric_():1154<info> Opened fabric: gni
libfabric:69061:gni:domain:gnix_domain_open():579<trace> [69061:1]
libfabric:69061:gni:fabric:gnix_domain_open():591<info> [69061:1] failed to find authorization key, creating new authorization key
libfabric:69061:gni:domain:_gnix_auth_key_enable():347<info> [69061:1] pkey=dd920000 ptag=14 key_partition_size=0 key_offset=0 enabled
libfabric:69061:gni:domain:gnix_domain_open():597<info> [69061:1] authorization key=0x857a10 ptag 14 cookie 0xdd920000
libfabric:69061:gni:mr:_gnix_notifier_open():88<warn> [69061:1] kdreg device open failed: Device or resource busy
<ERR> 0000056576 0x2aaaaab2dec0 cpu 000 nid00219(0) ERROR__ fi_domain : Device or resource busy
_______________________________________________
Libfabric-users mailing list
Libfabric-users at lists.openfabrics.org<mailto:Libfabric-users at lists.openfabrics.org>
https://lists.openfabrics.org/mailman/listinfo/libfabric-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200617/dd3938f5/attachment.htm>
More information about the Libfabric-users
mailing list