[libfabric-users] GNI: "invalid argument" when more than one client on node
Latham, Robert J.
robl at mcs.anl.gov
Sat Sep 14 12:51:44 PDT 2019
On Fri, 2019-09-13 at 20:42 +0000, Kevan Rehm wrote:
> Caveat emptor, this information only comes from reading code, not
> actual experience. The following message:
> libfabric:149618:gni:mr:__udreg_init():824<warn> [149618:1]
> Could not initialize udreg application cache, urc=2
> means that UDREG_CacheCreate() was called with a
> cache_attr.max_entries value which exceeds the remaining number of
> available kernel kdreg kcache entries. cache_attr.modes also has the
> UDREG_MODE_USE_KERNEL_CACHE bit set, which implies that the symbol
> HAVE_KDREG was defined when libfabric was built, which implies that
> the "--with-kdreg" configuration option was set to something other
> than "no". The value urc=2 corresponds to enum
> UDREG_RC_ERROR_RESOURCE in udreg_pub.h.
> By default gni libfabric picks the value 2048 for
> cache_attr.max_entries, which it gets from
> domain.udreg_reg_limit. The comment associated with the
> initialization of udreg_reg_limit says "we are likely sharing udreg
> entries with Craypich if we're using udreg cache so only ask for half
> the entries by default".
> If you are running multiple processes on the same node, then the
> total of all the entries in use by all the processes must exceed the
> maximum number available in the kernel. Does the first process
> succeed, then the others fail?
Yes, that seems to be what's happening here.
> You can reduce the value of udreg_reg_limit, see the fi_gni man page
> where it talks about the domain symbol
> "GNI_MR_UDREG_LIMIT". Setting it to a lower value like 1024 should
> allow additional processes to succeed. Alternatively you could
> reconfigure libfabric by specifying "--with-kdreg=no" and the problem
> should go away.
Thanks for the guidance.
I'm glad to make it explicit but in my current build configure.h
/* Define to 1 if kdreg available */
#define HAVE_KDREG 0
and explicitly configuring with `--with-kdreg=no` made no difference.
is there an environment variable for overriding domain symbols at run
time? I just hacked a lower default value of '128' in
I was assuming I could run one libfabric client per core on our
Cray. is that a bad assumption? Sounds like I might be one of the
first people who has wanted to run 64 libfabric processes per node on a
KNL node, but do you know of other resources I need to watch out for?
The smaller udeg_reg_limit let me run two processes per node. In fact,
I could run 64 processes per node (.. and found a bug in something that
was not libfabric)
Thanks for the tip.
More information about the Libfabric-users