[libfabric-users] GNI: "invalid argument" when more than one client on node

Kevan Rehm krehm at cray.com
Sun Sep 15 17:01:55 PDT 2019


Rob,

Hmmm, if your config.h file contains  "#define HAVE_KDREG 0", then the only other way that I can find that would take your program through the code that returns urc=2 is if your application is deliberately setting the GNI_MR_CACHE_LAZY_DEREG gni domain variable to 1 at runtime sometime just after opening the domain.   Could you scan your code, see if you get a hit on this symbol?

I was going to suggest that perhaps you were linking to the wrong libfabric library, but given that you hacked it and got further, that is not the problem.  

There aren't environment variables for these settings, they are runtime settings, they require application code changes, as mentioned in the fi_gni man page.   Look at prov/gni/test/mr.c at routine _set_lazy_deregistration() for an example of how to set that particular variable at runtime, all the other variables are modified similarly.

I had a typo in the last email, the variable name was supposed to be GNI_MR_UDREG_REG_LIMIT, not GNI_MR_UDREG_LIMIT, the man page has a typo, it is missing the additional _REG substring.

I'll let Howard respond to your questions on process limits per node, he has much more experience than I.

Kevan


On 9/14/19, 2:51 PM, "Latham, Robert J." <robl at mcs.anl.gov> wrote:

    On Fri, 2019-09-13 at 20:42 +0000, Kevan Rehm wrote:
    > Robert,
    > 
    > Caveat emptor, this information only comes from reading code, not
    > actual experience.  The following message:
    > 
    > 	libfabric:149618:gni:mr:__udreg_init():824<warn> [149618:1]
    > Could not initialize udreg application cache, urc=2
    > 
    > means that UDREG_CacheCreate() was called with a
    > cache_attr.max_entries value which exceeds the remaining number of
    > available kernel kdreg kcache entries.  cache_attr.modes also has the
    > UDREG_MODE_USE_KERNEL_CACHE bit set, which implies that the symbol
    > HAVE_KDREG was defined when libfabric was built, which implies that
    > the "--with-kdreg" configuration option was set to something other
    > than "no".   The value urc=2 corresponds to enum
    > UDREG_RC_ERROR_RESOURCE in udreg_pub.h.
    > 
    > By default gni libfabric picks the value 2048 for
    > cache_attr.max_entries, which it gets from
    > domain.udreg_reg_limit.   The comment associated with the
    > initialization of udreg_reg_limit says "we are likely sharing udreg
    > entries with Craypich if we're using udreg cache so only ask for half
    > the entries by default".
    > 
    > If you are running multiple processes on the same node, then the
    > total of all the entries in use by all the processes must exceed the
    > maximum number available in the kernel.   Does the first process
    > succeed, then the others fail?
    
    Yes, that seems to be what's happening here.
    
    > You can reduce the value of udreg_reg_limit, see the fi_gni man page
    > where it talks about the domain symbol
    > "GNI_MR_UDREG_LIMIT".   Setting it to a lower value like 1024 should
    > allow additional processes to succeed.   Alternatively you could
    > reconfigure libfabric by specifying "--with-kdreg=no" and the problem
    > should go away.
    
    Thanks for the guidance.
    
    I'm glad to make it explicit but in my current build configure.h
    already has 
    
    /* Define to 1 if kdreg available */
    #define HAVE_KDREG 0
    
    and explicitly configuring with `--with-kdreg=no` made no difference.
    
    is there an environment variable for overriding domain symbols at run
    time?  I just hacked a lower default value of '128' in
    prov/gni/src/gnix_dom.c
    
    I was assuming I could run one libfabric client per core on our
    Cray.  is that a bad assumption?   Sounds like I might be one of the
    first people who has wanted to run 64 libfabric processes per node on a
    KNL node, but do you know of other resources I need to watch out for?
    
    The smaller udeg_reg_limit let me run two processes per node.  In fact,
    I could run 64 processes per node (.. and found a bug in something that
    was not libfabric)
    
    Thanks for the tip.
    
    ==rob
    
    



More information about the Libfabric-users mailing list