On Fri, 2019-09-13 at 20:42 +0000, Kevan Rehm wrote:
> Robert,
> Caveat emptor, this information only comes from reading code, not
> actual experience.  The following message:
> 	libfabric:149618:gni:mr:__udreg_init():824<warn> [149618:1]
> Could not initialize udreg application cache, urc=2
> means that UDREG_CacheCreate() was called with a
> cache_attr.max_entries value which exceeds the remaining number of
> available kernel kdreg kcache entries.  cache_attr.modes also has the
> UDREG_MODE_USE_KERNEL_CACHE bit set, which implies that the symbol
> HAVE_KDREG was defined when libfabric was built, which implies that
> the "--with-kdreg" configuration option was set to something other
> than "no".   The value urc=2 corresponds to enum
> UDREG_RC_ERROR_RESOURCE in udreg_pub.h.
> By default gni libfabric picks the value 2048 for
> cache_attr.max_entries, which it gets from
> domain.udreg_reg_limit.   The comment associated with the
> initialization of udreg_reg_limit says "we are likely sharing udreg
> entries with Craypich if we're using udreg cache so only ask for half
> the entries by default".
> If you are running multiple processes on the same node, then the
> total of all the entries in use by all the processes must exceed the
> maximum number available in the kernel.   Does the first process
> succeed, then the others fail?

Yes, that seems to be what's happening here.

> You can reduce the value of udreg_reg_limit, see the fi_gni man page
> where it talks about the domain symbol
> "GNI_MR_UDREG_LIMIT".   Setting it to a lower value like 1024 should
> allow additional processes to succeed.   Alternatively you could
> reconfigure libfabric by specifying "--with-kdreg=no" and the problem
> should go away.

Thanks for the guidance.

I'm glad to make it explicit but in my current build configure.h
already has 

/* Define to 1 if kdreg available */
#define HAVE_KDREG 0

and explicitly configuring with `--with-kdreg=no` made no difference.

is there an environment variable for overriding domain symbols at run
time?  I just hacked a lower default value of '128' in

I was assuming I could run one libfabric client per core on our
Cray.  is that a bad assumption?   Sounds like I might be one of the
first people who has wanted to run 64 libfabric processes per node on a
KNL node, but do you know of other resources I need to watch out for?

The smaller udeg_reg_limit let me run two processes per node.  In fact,
I could run 64 processes per node (.. and found a bug in something that
was not libfabric)

Thanks for the tip.


