[libfabric-users] GNI: "invalid argument" when more than one client on node

Latham, Robert J. robl at mcs.anl.gov
Mon Sep 16 07:52:32 PDT 2019


On Mon, 2019-09-16 at 00:01 +0000, Kevan Rehm wrote:
> Rob,
>
> Hmmm, if your config.h file contains  "#define HAVE_KDREG 0", then
> the only other way that I can find that would take your program
> through the code that returns urc=2 is if your application is
> deliberately setting the GNI_MR_CACHE_LAZY_DEREG gni domain variable
> to 1 at runtime sometime just after opening the domain.   Could you
> scan your code, see if you get a hit on this symbol?

Hah, well look at that.  We have indeed been setting this for a couple of years:

https://github.com/mercury-hpc/mercury/blob/master/src/na/na_ofi.c#L1808

I was kind of surprised to see a GNI setting had bubbled up through two abstraction layers, but thanks for the suggestion.

I must have a defective mental model for what's going on here.  I did some experimenting and it turns out I only had to cut the value of udreg_reg_limit in half (1024) in order to get 64 process per node (up to three nodes so far) working.

Hopefuly this will be the last time you hear from me as I try to scale things up further.   I appreciate the the quick responses so far.

==rob

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20190916/c11cac1c/attachment.html>


More information about the Libfabric-users mailing list