[libfabric-users] Error allocating domain

Carns, Philip H. carns at mcs.anl.gov
Tue Jun 23 12:22:55 PDT 2020


Hi John,

You are quickly outpacing what little bit of knowledge I have here, but in our experience you do have to set up either protection domains or credentials to allow GNI RDMA between two process if they are launched manually.  aprun and srun do this step automatically so it's not something you usually have to think about for communication between the processes of a single MPI job.

I have an example of how to do this with static protection domains and aprun, but this might not be what you need for your system:

https://xgitlab.cels.anl.gov/sds/sds-tests/blob/master/perf-regression/theta/separate-ssg.qsub

A more recent variant of this would be Cray's DRC library.  I do not have an example for DRC, though.  I think to use DRC you need to make some explicit run time calls outside of libfabric to set it up, while the older static protection domain system was mostly configured on the command line and then passed to the executable environment via aprun arguments.

thanks,
-Phil
________________________________
From: Biddiscombe, John A. <biddisco at cscs.ch>
Sent: Tuesday, June 23, 2020 1:08 AM
To: Carns, Philip H. <carns at mcs.anl.gov>; Howard Pritchard <hppritcha at gmail.com>
Cc: libfabric-users at lists.openfabrics.org <libfabric-users at lists.openfabrics.org>
Subject: Re: [libfabric-users] Error allocating domain


I tried rebuilding libfabric using kdreg=no and unloading cray-mpich module


./configure --disable-verbs --disable-sockets --disable-usnic --disable-udp --disable-rxm --disable-rxd --disable-shm --disable-mrail --disable-tcp --disable-perf --disable-rstream --enable-gni --prefix=/apps/daint/UES/biddisco/gcc/8.3.0/libfabric --no-recursion --enable-debug --with-kdreg=n


the binaries look sensible


nid00023:/scratch/snx3000/biddisco/libfabric ((tags/v1.10.1^0))$ ldd /apps/daint/UES/biddisco/gcc/8.3.0/libfabric/bin/fi_pingpong
        linux-vdso.so.1 (0x00002aaaaaad3000)
        /apps/daint/UES/xalt/xalt2/software/xalt/2.7.24/lib64/libxalt_init.so (0x00002aaaaacd3000)
        libfabric.so.1 => /apps/daint/UES/biddisco/gcc/8.3.0/libfabric/lib/libfabric.so.1 (0x00002aaaaaee8000)
        libxpmem.so.0 => /opt/cray/xpmem/default/lib64/libxpmem.so.0 (0x00002aaaab259000)
        libudreg.so.0 => /opt/cray/udreg/default/lib64/libudreg.so.0 (0x00002aaaab45c000)
        libalpsutil.so.0 => /opt/cray/alps/default/lib64/libalpsutil.so.0 (0x00002aaaab666000)
        libalpslli.so.0 => /opt/cray/alps/default/lib64/libalpslli.so.0 (0x00002aaaab869000)
        libugni.so.0 => /opt/cray/ugni/default/lib64/libugni.so.0 (0x00002aaaaba6f000)
        libatomic.so.1 => /opt/gcc/8.3.0/snos/lib64/libatomic.so.1 (0x00002aaaabcf3000)
        librt.so.1 => /lib64/librt.so.1 (0x00002aaaabefb000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002aaaac103000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00002aaaac321000)
        libc.so.6 => /lib64/libc.so.6 (0x00002aaaac525000)
        /lib64/ld-linux-x86-64.so.2 (0x00002aaaaaaab000)
        libxmlrpc-epi.so.0 => /usr/lib64/libxmlrpc-epi.so.0 (0x00002aaaac8df000)
        libexpat.so.1 => /usr/lib64/libexpat.so.1 (0x00002aaaacaf2000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00002aaaacd24000)
        libelf.so.1 => /usr/lib64/libelf.so.1 (0x00002aaaacf27000)
        libodbc.so.2 => /usr/lib64/libodbc.so.2 (0x00002aaaad13f000)
        libwlm_detect.so.0 => /opt/cray/wlm_detect/default/lib64/libwlm_detect.so.0 (0x00002aaaad3af000)
        libz.so.1 => /lib64/libz.so.1 (0x00002aaaad5b2000)
        libltdl.so.7 => /usr/lib64/libltdl.so.7 (0x00002aaaad7c9000)

but running pingpond on two compute nodes gives the same memory registration error. I don't understand what has changed on our system and why what used to work, doesn't any more. Is it ok to launch jobs by hand, or do they have to be part of an srun script - here I am manually sshing into two compute nodes and executing

pingpong from one and pingpong addr from the other. I'm suspecting some strange permission error because of the message

libfabric:69550:gni:mr:__mr_cache_search_inuse():1205<debug> [69550:1] could not find key in inuse, key=2aaaadbd5000:c01000


If anyone has any idea what might be wrong, please let me know. thanks.


nid00023:/scratch/snx3000/biddisco/libfabric ((tags/v1.10.1^0))$ /apps/daint/UES/biddisco/gcc/8.3.0/libfabric/bin/fi_pingpong
libfabric:69550:core:core:fi_param_define_():231<debug> registered var perf_cntr
libfabric:69550:core:core:fi_param_get_():280<info> variable perf_cntr=<not set>
libfabric:69550:core:core:fi_param_define_():231<debug> registered var hook
libfabric:69550:core:core:fi_param_get_():280<info> variable hook=<not set>
libfabric:69550:core:core:fi_param_define_():231<debug> registered var mr_cache_max_size
libfabric:69550:core:core:fi_param_define_():231<debug> registered var mr_cache_max_count
libfabric:69550:core:core:fi_param_define_():231<debug> registered var mr_cache_monitor
libfabric:69550:core:core:fi_param_get_():280<info> variable mr_cache_max_size=<not set>
libfabric:69550:core:core:fi_param_get_():280<info> variable mr_cache_max_count=<not set>
libfabric:69550:core:core:fi_param_get_():280<info> variable mr_cache_monitor=<not set>
libfabric:69550:core:mr:ofi_default_cache_size():56<info> default cache size=468306659
libfabric:69550:core:core:fi_param_define_():231<debug> registered var provider
libfabric:69550:core:core:fi_param_define_():231<debug> registered var fork_unsafe
libfabric:69550:core:core:fi_param_define_():231<debug> registered var universe_size
libfabric:69550:core:core:fi_param_get_():280<info> variable provider=<not set>
libfabric:69550:core:core:fi_param_define_():231<debug> registered var provider_path
libfabric:69550:core:core:fi_param_get_():280<info> variable provider_path=<not set>
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:gni:fabric:__gnix_ccm_init():171<debug> [69550:1] Reading job info file /tmp/ccm_alps_info
libfabric:69550:gni:fabric:__gnix_alps_init():284<warn> [69550:1] lli get response failed, alps_status=4(No such file or directory)
libfabric:69550:gni:fabric:_gnix_nics_per_rank():672<warn> [69550:1] __gnix_app_init() failed, ret=-5(No such file or directory)
libfabric:69550:gni:fabric:_gnix_nic_init():1414<warn> [69550:1] _gnix_nics_per_rank failed: -5
libfabric:69550:core:core:ofi_register_provider():402<info> registering provider: gni (1.1)
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:69550:core:core:ofi_register_provider():402<info> registering provider: ofi_hook_debug (110.10)
libfabric:69550:core:core:ofi_register_provider():402<info> registering provider: ofi_hook_noop (110.10)
libfabric:69550:gni:fabric:_gnix_ep_getinfo():457<trace> [69550:1]
libfabric:69550:gni:fabric:_gnix_ep_getinfo():457<trace> [69550:1]
libfabric:69550:gni:fabric:_gnix_ep_getinfo():507<debug> [69550:1] Passed EP attributes check
libfabric:69550:gni:fabric:_gnix_ep_getinfo():522<debug> [69550:1] Passed mode check
libfabric:69550:gni:fabric:_gnix_ep_getinfo():532<debug> [69550:1] Passed caps check gnix_info->caps = 0x0f1c000000313f1e
libfabric:69550:gni:fabric:_gnix_ep_getinfo():547<debug> [69550:1] Passed TX attributes check
libfabric:69550:gni:fabric:_gnix_ep_getinfo():565<debug> [69550:1] Passed fabric name check
libfabric:69550:gni:fabric:__gnix_getinfo_resolve_node():417<info> [69550:1] node: (null) service: (null)
libfabric:69550:gni:fabric:__gnix_getinfo_resolve_node():422<info> [69550:1] src_pe: 0x17 src_port: 0x0
libfabric:69550:gni:fabric:_gnix_ep_getinfo():658<debug> [69550:1] Passed the domain attributes check
libfabric:69550:gni:fabric:_gnix_ep_getinfo():677<debug> [69550:1] Returning EP type: FI_EP_DGRAM
libfabric:69550:gni:fabric:_gnix_ep_getinfo():457<trace> [69550:1]
libfabric:69550:core:core:fi_getinfo_():967<debug> fi_getinfo: provider gni returned success
libfabric:69550:gni:core:_gnix_ref_init():254<debug> [69550:1] 0x6111b8 refs 1
libfabric:69550:core:core:fi_fabric_():1154<info> Opened fabric: gni
libfabric:69550:gni:eq:gnix_eq_open():380<trace> [69550:1]
libfabric:69550:gni:eq:gnix_verify_eq_attr():103<trace> [69550:1]
libfabric:69550:gni:core:_gnix_ref_init():254<debug> [69550:1] 0x6165c8 refs 1
libfabric:69550:gni:core:gnix_eq_open():398<debug> [69550:1] 0x6111b8 refs 2
libfabric:69550:gni:eq:gnix_eq_set_wait():76<trace> [69550:1]
libfabric:69550:gni:eq:gnix_wait_open():536<trace> [69550:1]
libfabric:69550:gni:eq:gnix_verify_wait_attr():367<trace> [69550:1]
libfabric:69550:gni:eq:gnix_init_wait_obj():387<trace> [69550:1]
libfabric:69550:gni:core:gnix_wait_open():564<debug> [69550:1] 0x6111b8 refs 3
libfabric:69550:gni:ep_ctrl:__gnix_wait_start_progress():175<trace> [69550:1]
libfabric:69550:gni:ep_ctrl:__gnix_wait_start_progress():179<trace> [69550:1]
libfabric:69550:gni:fabric:gnix_write_proc_job():528<warn> [69550:1] write(disable_affinity_apply) failed, errno=Invalid argument
libfabric:69550:gni:eq:__gnix_wait_start_progress():185<warn> [69550:1] _gnix_job_disable call returned -22
libfabric:69550:gni:ep_ctrl:__gnix_wait_nic_prog_thread_fn():72<trace> [69550:2]
libfabric:69550:gni:domain:gnix_domain_open():579<trace> [69550:1]
libfabric:69550:gni:fabric:gnix_domain_open():591<info> [69550:1] failed to find authorization key, creating new authorization key
libfabric:69550:gni:fabric:__gnix_ccm_init():171<debug> [69550:1] Reading job info file /tmp/ccm_alps_info
libfabric:69550:gni:fabric:__gnix_alps_init():284<warn> [69550:1] lli get response failed, alps_status=4(No such file or directory)
libfabric:69550:gni:fabric:gnixu_get_rdma_credentials():437<warn> [69550:1] __gnix_app_init() failed, ret=-5(No such file or directory)
libfabric:69550:gni:domain:_gnix_auth_key_enable():347<info> [69550:1] pkey=00002aaa ptag=171 key_partition_size=0 key_offset=0 enabled
libfabric:69550:gni:domain:gnix_domain_open():597<info> [69550:1] authorization key=0x619870 ptag 171 cookie 0x2aaa
libfabric:69550:gni:core:gnix_domain_open():652<debug> [69550:1] 0x6111b8 refs 4
libfabric:69550:gni:core:_gnix_ref_init():254<debug> [69550:1] 0x6199b0 refs 1
libfabric:69550:gni:mr:_gnix_auth_key_enable():354<debug> [69550:1] authorization key already enabled, auth_key=0x619870
libfabric:69550:gni:mr:_gnix_mr_reg():222<trace> [69550:1]
libfabric:69550:gni:mr:_gnix_mr_reg():224<info> [69550:1] reg: buf=0x2aaaadbd5000 len=12587008
libfabric:69550:gni:mr:_gnix_mr_cache_init():998<trace> [69550:1]
libfabric:69550:gni:mr:_gnix_mr_cache_init():998<trace> [69550:1]
libfabric:69550:gni:mr:_gnix_mr_cache_register():1541<trace> [69550:1]
libfabric:69550:gni:mr:__mr_cache_search_inuse():1205<debug> [69550:1] could not find key in inuse, key=2aaaadbd5000:c01000
libfabric:69550:gni:mr:__gnix_register_region():692<debug> [69550:1] addr 0x2aaaadbd5000 len 12587008 flags 0x0
libfabric:69550:gni:ep_ctrl:gnix_nic_alloc():954<trace> [69550:1]
libfabric:69550:gni:ep_ctrl:gnix_nic_alloc():1059<warn> [69550:1] GNI_CdmAttach returned GNI_RC_INVALID_PARAM
libfabric:69550:gni:fabric:_gnix_dump_gni_res():729<warn> [69550:1] Device Resources:
dev res:       MDD, avail: 4089 res: 409 held: 0 total: 4095
dev res:        CQ, avail: 2042 res: 10 held: 0 total: 2047
dev res:       FMA, avail: 126 res: 4 held: 0 total: 127
dev res:        CE, avail: 4 res: 0 held: 0 total: 4
dev res:       DLA, avail: 16384 res: 1024 held: 0 total: 16384
dev res:       TCR, avail: 64984 res: 0 held: 0 total: 16
dev res:       DVA, avail: 4398046511104 res: 1099511627776 held: 0 total: 4398046511104
dev res:      VMDH, avail: 4 res: 0 held: 0 total: 4
libfabric:69550:gni:fabric:_gnix_dump_gni_res():745<warn> [69550:1] Job Resources:
libfabric:69550:gni:mr:__gnix_generic_register():609<info> [69550:1] could not allocate nic to do mr_reg, ret=-22
libfabric:69550:gni:mr:__mr_cache_create_registration():1465<info> [69550:1] failed to register memory with callback
fi_mr_reg(): util/pingpong.c:1329, ret=-12 (Cannot allocate memory)
libfabric:69550:gni:eq:gnix_eq_close():452<trace> [69550:1]
libfabric:69550:gni:core:gnix_eq_close():459<debug> [69550:1] 0x6165c8 refs 0
libfabric:69550:gni:core:__eq_destruct():243<debug> [69550:1] 0x6111b8 refs 3
libfabric:69550:gni:eq:gnix_wait_close():505<trace> [69550:1]
libfabric:69550:gni:core:gnix_wait_close():520<debug> [69550:1] 0x6111b8 refs 2
libfabric:69550:gni:ep_ctrl:__gnix_wait_stop_progress():201<trace> [69550:1]
libfabric:69550:gni:domain:gnix_domain_close():218<trace> [69550:1]
libfabric:69550:gni:mr:__mr_cache_flush():1109<trace> [69550:1]
libfabric:69550:gni:mr:__mr_cache_flush():1111<debug> [69550:1] starting flush on memory registration cache
libfabric:69550:gni:mr:__mr_cache_flush():1109<trace> [69550:1]
libfabric:69550:gni:mr:__mr_cache_flush():1111<debug> [69550:1] starting flush on memory registration cache
libfabric:69550:gni:core:gnix_domain_close():265<debug> [69550:1] 0x6199b0 refs 0
libfabric:69550:gni:domain:__domain_destruct():77<trace> [69550:1]
libfabric:69550:gni:mr:_gnix_mr_cache_destroy():1071<trace> [69550:1]
libfabric:69550:gni:mr:__mr_cache_flush():1109<trace> [69550:1]
libfabric:69550:gni:mr:__mr_cache_flush():1111<debug> [69550:1] starting flush on memory registration cache
libfabric:69550:gni:mr:_gnix_mr_cache_destroy():1071<trace> [69550:1]
libfabric:69550:gni:mr:__mr_cache_flush():1109<trace> [69550:1]
libfabric:69550:gni:mr:__mr_cache_flush():1111<debug> [69550:1] starting flush on memory registration cache
libfabric:69550:gni:core:__domain_destruct():103<debug> [69550:1] 0x6111b8 refs 1
libfabric:69550:gni:domain:gnix_domain_close():274<info> [69550:1] gnix_domain_close invoked returning 0
libfabric:69550:gni:core:gnix_fabric_close():194<debug> [69550:1] 0x6111b8 refs 0

JB

________________________________
From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Biddiscombe, John A. <biddisco at cscs.ch>
Sent: 18 June 2020 00:00:30
To: Carns, Philip H.; Howard Pritchard
Cc: libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Error allocating domain


Phil - thanks for this info.  I will experiment with it. I'm actually away for a few days from tomorrow, so it'll be next week before I get a chance, but I'll report back if I have success (or more problems).


JB

________________________________
From: Carns, Philip H. <carns at mcs.anl.gov>
Sent: 17 June 2020 19:52:42
To: Biddiscombe, John A.; Howard Pritchard
Cc: libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Error allocating domain

Hi John,

I know your question is aimed at Howard, but I can offer another data point and an example of a software stack working around this.  I've never gotten kdreg to work in executables that are also using Cray's MPI; they conflict.  If you want to use udreg as an alternative, then you'll need to do two things:

a) disable kdreg support in libfabric at build time (as in this spack package here: https://github.com/spack/spack/blob/develop/var/spack/repos/builtin/packages/libfabric/package.py#L94)

b) explicitly enable and configure udreg outside of libfabric (as in the Mercury libfabric plugin here: https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c#L1778)

This configuration is stable for us and works fine whether Cray MPI is present or not.  I'll defer to Howard about the technical implications, though 🙂

thanks,
-Phil
________________________________
From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Biddiscombe, John A. <biddisco at cscs.ch>
Sent: Wednesday, June 17, 2020 1:32 PM
To: Howard Pritchard <hppritcha at gmail.com>
Cc: libfabric-users at lists.openfabrics.org <libfabric-users at lists.openfabrics.org>
Subject: Re: [libfabric-users] Error allocating domain


Howard


From the phrasing "You are hitting a limitation with the ancient kdreg device driver.  It may be best to not use it for your libfabric app." is there anything I can do about it. I can see that there is a udreg directory in /opt/cray - is there anything I can replace the kdreg stuff with?


Thanks


JB


________________________________
From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Biddiscombe, John A. <biddisco at cscs.ch>
Sent: 17 June 2020 17:26:29
To: Howard Pritchard
Cc: libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Error allocating domain


my config line has always been this (apart from the debug). It has worked for several years until a recent system maintenance.change or something of that kind. (Nobody here claims to have changed anything significant)


./configure --disable-verbs --disable-sockets --disable-usnic --disable-udp --disable-rxm --disable-rxd --disable-shm --disable-mrail --disable-tcp --disable-perf --disable-rstream --enable-gni --prefix=/apps/daint/UES/biddisco/gcc/8.3.0/libfabric CC=/opt/cray/pe/craype/default/bin/cc CFLAGS=-fPIC LDFLAGS=-ldl --no-recursion --enable-debug


JB

________________________________
From: Howard Pritchard <hppritcha at gmail.com>
Sent: 17 June 2020 17:20:21
To: Biddiscombe, John A.
Cc: libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Error allocating domain

Hi John,

You are hitting a limitation with the ancient kdreg device driver.  It may be best to not use it for your libfabric app.  What are the configure options you're using for building libfabric?

Howard


Am Di., 16. Juni 2020 um 10:34 Uhr schrieb Biddiscombe, John A. <biddisco at cscs.ch<mailto:biddisco at cscs.ch>>:

I've got this log when I dump out my own messages, and also enable debugging in libfabric - can anyone tell what's wrong from the message. Code that used to work seems to have stopped. I upgraded to libfabric 1.10.1 tag and rebuilt, but it didn't change.

The only thing that springs to mind is that the application is also using MPI on the cray at the same time, so when this code is called, mpi_init would have already been called, and perhaps somehow the nic is inaccessible - hence the error. I'm sure it used to work - and if I use ranks = 1 - it runs - so perhaps mpi detects just one rank and does no initialization, but when I use N>1 ranks, it dies. Any suggestions welcome. Thanks

JB


<DEB> 0000056511 0x2aaaaab2dec0 cpu 000 nid00219(0)   CONTROL Allocating domain
libfabric:69061:gni:core:_gnix_ref_init():254<debug> [69061:1] 0x8579d8 refs 1
libfabric:69061:core:core:fi_fabric_():1154<info> Opened fabric: gni
libfabric:69061:gni:domain:gnix_domain_open():579<trace> [69061:1]
libfabric:69061:gni:fabric:gnix_domain_open():591<info> [69061:1] failed to find authorization key, creating new authorization key
libfabric:69061:gni:domain:_gnix_auth_key_enable():347<info> [69061:1] pkey=dd920000 ptag=14 key_partition_size=0 key_offset=0 enabled
libfabric:69061:gni:domain:gnix_domain_open():597<info> [69061:1] authorization key=0x857a10 ptag 14 cookie 0xdd920000
libfabric:69061:gni:mr:_gnix_notifier_open():88<warn> [69061:1] kdreg device open failed: Device or resource busy
<ERR> 0000056576 0x2aaaaab2dec0 cpu 000 nid00219(0)   ERROR__ fi_domain : Device or resource busy


_______________________________________________
Libfabric-users mailing list
Libfabric-users at lists.openfabrics.org<mailto:Libfabric-users at lists.openfabrics.org>
https://lists.openfabrics.org/mailman/listinfo/libfabric-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200623/d80c0d61/attachment-0001.htm>


More information about the Libfabric-users mailing list