[libfabric-users] Error allocating domain

Biddiscombe, John A. biddisco at cscs.ch
Wed Jun 17 08:26:29 PDT 2020


my config line has always been this (apart from the debug). It has worked for several years until a recent system maintenance.change or something of that kind. (Nobody here claims to have changed anything significant)


./configure --disable-verbs --disable-sockets --disable-usnic --disable-udp --disable-rxm --disable-rxd --disable-shm --disable-mrail --disable-tcp --disable-perf --disable-rstream --enable-gni --prefix=/apps/daint/UES/biddisco/gcc/8.3.0/libfabric CC=/opt/cray/pe/craype/default/bin/cc CFLAGS=-fPIC LDFLAGS=-ldl --no-recursion --enable-debug


JB

________________________________
From: Howard Pritchard <hppritcha at gmail.com>
Sent: 17 June 2020 17:20:21
To: Biddiscombe, John A.
Cc: libfabric-users at lists.openfabrics.org
Subject: Re: [libfabric-users] Error allocating domain

Hi John,

You are hitting a limitation with the ancient kdreg device driver.  It may be best to not use it for your libfabric app.  What are the configure options you're using for building libfabric?

Howard


Am Di., 16. Juni 2020 um 10:34 Uhr schrieb Biddiscombe, John A. <biddisco at cscs.ch<mailto:biddisco at cscs.ch>>:

I've got this log when I dump out my own messages, and also enable debugging in libfabric - can anyone tell what's wrong from the message. Code that used to work seems to have stopped. I upgraded to libfabric 1.10.1 tag and rebuilt, but it didn't change.

The only thing that springs to mind is that the application is also using MPI on the cray at the same time, so when this code is called, mpi_init would have already been called, and perhaps somehow the nic is inaccessible - hence the error. I'm sure it used to work - and if I use ranks = 1 - it runs - so perhaps mpi detects just one rank and does no initialization, but when I use N>1 ranks, it dies. Any suggestions welcome. Thanks

JB


<DEB> 0000056511 0x2aaaaab2dec0 cpu 000 nid00219(0)   CONTROL Allocating domain
libfabric:69061:gni:core:_gnix_ref_init():254<debug> [69061:1] 0x8579d8 refs 1
libfabric:69061:core:core:fi_fabric_():1154<info> Opened fabric: gni
libfabric:69061:gni:domain:gnix_domain_open():579<trace> [69061:1]
libfabric:69061:gni:fabric:gnix_domain_open():591<info> [69061:1] failed to find authorization key, creating new authorization key
libfabric:69061:gni:domain:_gnix_auth_key_enable():347<info> [69061:1] pkey=dd920000 ptag=14 key_partition_size=0 key_offset=0 enabled
libfabric:69061:gni:domain:gnix_domain_open():597<info> [69061:1] authorization key=0x857a10 ptag 14 cookie 0xdd920000
libfabric:69061:gni:mr:_gnix_notifier_open():88<warn> [69061:1] kdreg device open failed: Device or resource busy
<ERR> 0000056576 0x2aaaaab2dec0 cpu 000 nid00219(0)   ERROR__ fi_domain : Device or resource busy


_______________________________________________
Libfabric-users mailing list
Libfabric-users at lists.openfabrics.org<mailto:Libfabric-users at lists.openfabrics.org>
https://lists.openfabrics.org/mailman/listinfo/libfabric-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200617/d26e72f2/attachment.htm>


More information about the Libfabric-users mailing list