[libfabric-users] Error allocating domain

Biddiscombe, John A. biddisco at cscs.ch
Tue Jun 16 10:04:30 PDT 2020


Not sure if it's connected but if I ssh into compute nodes and run fi_pingpong on two nodes, the client dies with


/apps/daint/UES/biddisco/gcc/8.3.0/libfabric/bin/fi_pingpong 148.187.32.221
[error] /apps/daint/UES/biddisco/src/libfabric-cray/util/pingpong.c:521 : ctrl/read: no data or remote connection closed


and the sever dies with

nid00220:~$ /apps/daint/UES/biddisco/gcc/8.3.0/libfabric/bin/fi_pingpong
libfabric:55223:core:core:fi_param_define_():231<debug> registered var perf_cntr
libfabric:55223:core:core:fi_param_get_():280<info> variable perf_cntr=<not set>
libfabric:55223:core:core:fi_param_define_():231<debug> registered var hook
libfabric:55223:core:core:fi_param_get_():280<info> variable hook=<not set>
libfabric:55223:core:core:fi_param_define_():231<debug> registered var mr_cache_max_size
libfabric:55223:core:core:fi_param_define_():231<debug> registered var mr_cache_max_count
libfabric:55223:core:core:fi_param_define_():231<debug> registered var mr_cache_monitor
libfabric:55223:core:core:fi_param_get_():280<info> variable mr_cache_max_size=<not set>
libfabric:55223:core:core:fi_param_get_():280<info> variable mr_cache_max_count=<not set>
libfabric:55223:core:core:fi_param_get_():280<info> variable mr_cache_monitor=<not set>
libfabric:55223:core:mr:ofi_default_cache_size():56<info> default cache size=468306659
libfabric:55223:core:core:fi_param_define_():231<debug> registered var provider
libfabric:55223:core:core:fi_param_define_():231<debug> registered var fork_unsafe
libfabric:55223:core:core:fi_param_define_():231<debug> registered var universe_size
libfabric:55223:core:core:fi_param_get_():280<info> variable provider=<not set>
libfabric:55223:core:core:fi_param_define_():231<debug> registered var provider_path
libfabric:55223:core:core:fi_param_get_():280<info> variable provider_path=<not set>
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:gni:fabric:__gnix_ccm_init():171<debug> [55223:1] Reading job info file /tmp/ccm_alps_info
libfabric:55223:gni:fabric:__gnix_alps_init():284<warn> [55223:1] lli get response failed, alps_status=4(No such file or directory)
libfabric:55223:gni:fabric:_gnix_nics_per_rank():672<warn> [55223:1] __gnix_app_init() failed, ret=-5(No such file or directory)
libfabric:55223:gni:fabric:_gnix_nic_init():1414<warn> [55223:1] _gnix_nics_per_rank failed: -5
libfabric:55223:core:core:ofi_register_provider():402<info> registering provider: gni (1.1)
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():396<debug> no provider structure or name
libfabric:55223:core:core:ofi_register_provider():402<info> registering provider: ofi_hook_debug (110.10)
libfabric:55223:core:core:ofi_register_provider():402<info> registering provider: ofi_hook_noop (110.10)
libfabric:55223:gni:fabric:_gnix_ep_getinfo():457<trace> [55223:1]
libfabric:55223:gni:fabric:_gnix_ep_getinfo():457<trace> [55223:1]
libfabric:55223:gni:fabric:_gnix_ep_getinfo():507<debug> [55223:1] Passed EP attributes check
libfabric:55223:gni:fabric:_gnix_ep_getinfo():522<debug> [55223:1] Passed mode check
libfabric:55223:gni:fabric:_gnix_ep_getinfo():532<debug> [55223:1] Passed caps check gnix_info->caps = 0x0f1c000000313f1e
libfabric:55223:gni:fabric:_gnix_ep_getinfo():547<debug> [55223:1] Passed TX attributes check
libfabric:55223:gni:fabric:_gnix_ep_getinfo():565<debug> [55223:1] Passed fabric name check
libfabric:55223:gni:fabric:__gnix_getinfo_resolve_node():417<info> [55223:1] node: (null) service: (null)
libfabric:55223:gni:fabric:__gnix_getinfo_resolve_node():422<info> [55223:1] src_pe: 0xdc src_port: 0x0
libfabric:55223:gni:fabric:_gnix_ep_getinfo():658<debug> [55223:1] Passed the domain attributes check
libfabric:55223:gni:fabric:_gnix_ep_getinfo():677<debug> [55223:1] Returning EP type: FI_EP_DGRAM
libfabric:55223:gni:fabric:_gnix_ep_getinfo():457<trace> [55223:1]
libfabric:55223:core:core:fi_getinfo_():967<debug> fi_getinfo: provider gni returned success
libfabric:55223:gni:core:_gnix_ref_init():254<debug> [55223:1] 0x616e08 refs 1
libfabric:55223:core:core:fi_fabric_():1154<info> Opened fabric: gni
libfabric:55223:gni:eq:gnix_eq_open():380<trace> [55223:1]
libfabric:55223:gni:eq:gnix_verify_eq_attr():103<trace> [55223:1]
libfabric:55223:gni:core:_gnix_ref_init():254<debug> [55223:1] 0x616eb8 refs 1
libfabric:55223:gni:core:gnix_eq_open():398<debug> [55223:1] 0x616e08 refs 2
libfabric:55223:gni:eq:gnix_eq_set_wait():76<trace> [55223:1]
libfabric:55223:gni:eq:gnix_wait_open():536<trace> [55223:1]
libfabric:55223:gni:eq:gnix_verify_wait_attr():367<trace> [55223:1]
libfabric:55223:gni:eq:gnix_init_wait_obj():387<trace> [55223:1]
libfabric:55223:gni:core:gnix_wait_open():564<debug> [55223:1] 0x616e08 refs 3
libfabric:55223:gni:ep_ctrl:__gnix_wait_start_progress():175<trace> [55223:1]
libfabric:55223:gni:ep_ctrl:__gnix_wait_start_progress():179<trace> [55223:1]
libfabric:55223:gni:fabric:gnix_write_proc_job():528<warn> [55223:1] write(disable_affinity_apply) failed, errno=Invalid argument
libfabric:55223:gni:eq:__gnix_wait_start_progress():185<warn> [55223:1] _gnix_job_disable call returned -22
libfabric:55223:gni:domain:gnix_domain_open():579<trace> [55223:1]
libfabric:55223:gni:fabric:gnix_domain_open():591<info> [55223:1] failed to find authorization key, creating new authorization key
libfabric:55223:gni:ep_ctrl:__gnix_wait_nic_prog_thread_fn():72<trace> [55223:2]
libfabric:55223:gni:fabric:__gnix_ccm_init():171<debug> [55223:1] Reading job info file /tmp/ccm_alps_info
libfabric:55223:gni:fabric:__gnix_alps_init():284<warn> [55223:1] lli get response failed, alps_status=4(No such file or directory)
libfabric:55223:gni:fabric:gnixu_get_rdma_credentials():437<warn> [55223:1] __gnix_app_init() failed, ret=-5(No such file or directory)
libfabric:55223:gni:domain:_gnix_auth_key_enable():347<info> [55223:1] pkey=00002aaa ptag=171 key_partition_size=0 key_offset=0 enabled
libfabric:55223:gni:domain:gnix_domain_open():597<info> [55223:1] authorization key=0x61a1a0 ptag 171 cookie 0x2aaa
libfabric:55223:gni:core:gnix_domain_open():652<debug> [55223:1] 0x616e08 refs 4
libfabric:55223:gni:core:_gnix_ref_init():254<debug> [55223:1] 0x61a2e0 refs 1
libfabric:55223:gni:mr:_gnix_auth_key_enable():354<debug> [55223:1] authorization key already enabled, auth_key=0x61a1a0
libfabric:55223:gni:mr:_gnix_mr_reg():222<trace> [55223:1]
libfabric:55223:gni:mr:_gnix_mr_reg():224<info> [55223:1] reg: buf=0x2aaaadfe7000 len=12587008
libfabric:55223:gni:mr:_gnix_mr_cache_init():998<trace> [55223:1]
libfabric:55223:gni:mr:_gnix_mr_cache_init():998<trace> [55223:1]
libfabric:55223:gni:mr:_gnix_mr_cache_register():1541<trace> [55223:1]
libfabric:55223:gni:mr:_gnix_notifier_get_event():270<debug> [55223:1] nothing to read from kdreg :(
libfabric:55223:gni:fabric:__gnix_smrn_read_events():139<debug> [55223:1] no more events to be read
libfabric:55223:gni:mr:__mr_cache_search_inuse():1205<debug> [55223:1] could not find key in inuse, key=2aaaadfe7000:c01000
libfabric:55223:gni:mr:_gnix_notifier_get_event():270<debug> [55223:1] nothing to read from kdreg :(
libfabric:55223:gni:fabric:__gnix_smrn_read_events():139<debug> [55223:1] no more events to be read
libfabric:55223:gni:mr:_gnix_notifier_get_event():270<debug> [55223:1] nothing to read from kdreg :(
libfabric:55223:gni:fabric:__gnix_smrn_read_events():139<debug> [55223:1] no more events to be read
libfabric:55223:gni:mr:__mr_cache_search_stale():1335<debug> [55223:1] searching for stale entry, key=2aaaadfe7000:c01000
libfabric:55223:gni:mr:__gnix_register_region():692<debug> [55223:1] addr 0x2aaaadfe7000 len 12587008 flags 0x0
libfabric:55223:gni:ep_ctrl:gnix_nic_alloc():954<trace> [55223:1]
libfabric:55223:gni:ep_ctrl:gnix_nic_alloc():1059<warn> [55223:1] GNI_CdmAttach returned GNI_RC_INVALID_PARAM
libfabric:55223:gni:fabric:_gnix_dump_gni_res():729<warn> [55223:1] Device Resources:
dev res:       MDD, avail: 4089 res: 409 held: 0 total: 4095
dev res:        CQ, avail: 2042 res: 10 held: 0 total: 2047
dev res:       FMA, avail: 126 res: 4 held: 0 total: 127
dev res:        CE, avail: 4 res: 0 held: 0 total: 4
dev res:       DLA, avail: 16384 res: 1024 held: 0 total: 16384
dev res:       TCR, avail: 65292 res: 0 held: 0 total: 16
dev res:       DVA, avail: 4398046511104 res: 1099511627776 held: 0 total: 4398046511104
dev res:      VMDH, avail: 4 res: 0 held: 0 total: 4
libfabric:55223:gni:fabric:_gnix_dump_gni_res():745<warn> [55223:1] Job Resources:
libfabric:55223:gni:mr:__gnix_generic_register():609<info> [55223:1] could not allocate nic to do mr_reg, ret=-22
libfabric:55223:gni:mr:__mr_cache_create_registration():1465<info> [55223:1] failed to register memory with callback
fi_mr_reg(): /apps/daint/UES/biddisco/src/libfabric-cray/util/pingpong.c:1329, ret=-12 (Cannot allocate memory)
libfabric:55223:gni:eq:gnix_eq_close():452<trace> [55223:1]
libfabric:55223:gni:core:gnix_eq_close():459<debug> [55223:1] 0x616eb8 refs 0
libfabric:55223:gni:core:__eq_destruct():243<debug> [55223:1] 0x616e08 refs 3
libfabric:55223:gni:eq:gnix_wait_close():505<trace> [55223:1]
libfabric:55223:gni:core:gnix_wait_close():520<debug> [55223:1] 0x616e08 refs 2
libfabric:55223:gni:ep_ctrl:__gnix_wait_stop_progress():201<trace> [55223:1]
libfabric:55223:gni:domain:gnix_domain_close():218<trace> [55223:1]
libfabric:55223:gni:mr:__mr_cache_flush():1109<trace> [55223:1]
libfabric:55223:gni:mr:__mr_cache_flush():1111<debug> [55223:1] starting flush on memory registration cache
libfabric:55223:gni:mr:__mr_cache_flush():1155<debug> [55223:1] flushed 0 of 0 entries from memory registration cache
libfabric:55223:gni:mr:__mr_cache_flush():1109<trace> [55223:1]
libfabric:55223:gni:mr:__mr_cache_flush():1111<debug> [55223:1] starting flush on memory registration cache
libfabric:55223:gni:mr:__mr_cache_flush():1155<debug> [55223:1] flushed 0 of 0 entries from memory registration cache
libfabric:55223:gni:core:gnix_domain_close():265<debug> [55223:1] 0x61a2e0 refs 0
libfabric:55223:gni:domain:__domain_destruct():77<trace> [55223:1]
libfabric:55223:gni:mr:_gnix_mr_cache_destroy():1071<trace> [55223:1]
libfabric:55223:gni:mr:__mr_cache_flush():1109<trace> [55223:1]
libfabric:55223:gni:mr:__mr_cache_flush():1111<debug> [55223:1] starting flush on memory registration cache
libfabric:55223:gni:mr:__mr_cache_flush():1155<debug> [55223:1] flushed 0 of 0 entries from memory registration cache
libfabric:55223:gni:mr:_gnix_mr_cache_destroy():1071<trace> [55223:1]
libfabric:55223:gni:mr:__mr_cache_flush():1109<trace> [55223:1]
libfabric:55223:gni:mr:__mr_cache_flush():1111<debug> [55223:1] starting flush on memory registration cache
libfabric:55223:gni:mr:__mr_cache_flush():1155<debug> [55223:1] flushed 0 of 0 entries from memory registration cache
libfabric:55223:gni:core:__domain_destruct():103<debug> [55223:1] 0x616e08 refs 1
libfabric:55223:gni:domain:gnix_domain_close():274<info> [55223:1] gnix_domain_close invoked returning 0
libfabric:55223:gni:core:gnix_fabric_close():194<debug> [55223:1] 0x616e08 refs 0

JB


________________________________
From: Libfabric-users <libfabric-users-bounces at lists.openfabrics.org> on behalf of Biddiscombe, John A. <biddisco at cscs.ch>
Sent: 16 June 2020 18:33:53
To: libfabric-users at lists.openfabrics.org
Subject: [libfabric-users] Error allocating domain


I've got this log when I dump out my own messages, and also enable debugging in libfabric - can anyone tell what's wrong from the message. Code that used to work seems to have stopped. I upgraded to libfabric 1.10.1 tag and rebuilt, but it didn't change.

The only thing that springs to mind is that the application is also using MPI on the cray at the same time, so when this code is called, mpi_init would have already been called, and perhaps somehow the nic is inaccessible - hence the error. I'm sure it used to work - and if I use ranks = 1 - it runs - so perhaps mpi detects just one rank and does no initialization, but when I use N>1 ranks, it dies. Any suggestions welcome. Thanks

JB


<DEB> 0000056511 0x2aaaaab2dec0 cpu 000 nid00219(0)   CONTROL Allocating domain
libfabric:69061:gni:core:_gnix_ref_init():254<debug> [69061:1] 0x8579d8 refs 1
libfabric:69061:core:core:fi_fabric_():1154<info> Opened fabric: gni
libfabric:69061:gni:domain:gnix_domain_open():579<trace> [69061:1]
libfabric:69061:gni:fabric:gnix_domain_open():591<info> [69061:1] failed to find authorization key, creating new authorization key
libfabric:69061:gni:domain:_gnix_auth_key_enable():347<info> [69061:1] pkey=dd920000 ptag=14 key_partition_size=0 key_offset=0 enabled
libfabric:69061:gni:domain:gnix_domain_open():597<info> [69061:1] authorization key=0x857a10 ptag 14 cookie 0xdd920000
libfabric:69061:gni:mr:_gnix_notifier_open():88<warn> [69061:1] kdreg device open failed: Device or resource busy
<ERR> 0000056576 0x2aaaaab2dec0 cpu 000 nid00219(0)   ERROR__ fi_domain : Device or resource busy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200616/719c1130/attachment-0001.htm>


More information about the Libfabric-users mailing list