[libfabric-users] GNI: "invalid argument" when more than one client on node

Kevan Rehm krehm at cray.com
Fri Sep 13 13:42:33 PDT 2019


Robert,

Caveat emptor, this information only comes from reading code, not actual experience.  The following message:

	libfabric:149618:gni:mr:__udreg_init():824<warn> [149618:1] Could not initialize udreg application cache, urc=2

means that UDREG_CacheCreate() was called with a cache_attr.max_entries value which exceeds the remaining number of available kernel kdreg kcache entries.  cache_attr.modes also has the UDREG_MODE_USE_KERNEL_CACHE bit set, which implies that the symbol HAVE_KDREG was defined when libfabric was built, which implies that the "--with-kdreg" configuration option was set to something other than "no".   The value urc=2 corresponds to enum UDREG_RC_ERROR_RESOURCE in udreg_pub.h.

By default gni libfabric picks the value 2048 for cache_attr.max_entries, which it gets from domain.udreg_reg_limit.   The comment associated with the initialization of udreg_reg_limit says "we are likely sharing udreg entries with Craypich if we're using udreg cache so only ask for half the entries by default".

If you are running multiple processes on the same node, then the total of all the entries in use by all the processes must exceed the maximum number available in the kernel.   Does the first process succeed, then the others fail?  

You can reduce the value of udreg_reg_limit, see the fi_gni man page where it talks about the domain symbol "GNI_MR_UDREG_LIMIT".   Setting it to a lower value like 1024 should allow additional processes to succeed.   Alternatively you could reconfigure libfabric by specifying "--with-kdreg=no" and the problem should go away.

Good luck,

Kevan

On 9/12/19, 8:59 PM, "Libfabric-users on behalf of Latham, Robert J. via Libfabric-users" <libfabric-users-bounces at lists.openfabrics.org on behalf of libfabric-users at lists.openfabrics.org> wrote:

    I have a distributed service using libfabric on our Cray that seems to
    work ok as long as it is just one client.   If I have two servers and
    two clients, I get an error about invalid flags passed to
    gnix_mr_reg.  The trace (last few lines of which I have included below)
    ends at this check for several parameters but I don't know which one
    was invalid (yet.  I'll patch in some debugging and find out more in
    the morning)
    
    
    https://github.com/ofiwg/libfabric/blob/master/prov/gni/src/gnix_mr.c#L230
    
    We got into this path from a call to 'fi_enable', which only takes one
    argument and doesn't seem like the kind of routine I can call
    incorrectly.
    
    Any suggestions what I'm doing wrong here?
    
    libfabric:134822:gni:fabric:_gnix_resolve_gni_ep_name():120<trace>
    [134822:1] 
    libfabric:134822:gni:ep_ctrl:_gnix_cm_nic_alloc():628<info> [134822:1]
    creating cm_nic for 219/0x44710000/15360001
    libfabric:134822:gni:ep_ctrl:gnix_nic_alloc():954<trace> [134822:1] 
    libfabric:149618:gni:ep_ctrl:gnix_ep_bind():1813<trace> [149618:1] 
    libfabric:149618:gni:ep_ctrl:gnix_ep_bind():1813<trace> [149618:1] 
    libfabric:149618:gni:ep_ctrl:gnix_ep_control():1529<trace> [149618:1] 
    libfabric:149618:gni:ep_ctrl:_gnix_vc_cm_init():2217<trace> [149618:1] 
    libfabric:149618:gni:ep_ctrl:_gnix_cm_nic_reg_recv_fn():505<trace>
    [149618:1] 
    libfabric:149618:gni:ep_ctrl:_gnix_cm_nic_enable():523<trace>
    [149618:1] 
    libfabric:149618:gni:ep_ctrl:_gnix_dgram_alloc():244<trace> [149618:1] 
    libfabric:149618:gni:ep_ctrl:_gnix_dgram_wc_post():312<trace>
    [149618:1] 
    libfabric:149618:gni:ep_ctrl:_gnix_dgram_alloc():244<trace> [149618:1] 
    libfabric:149618:gni:ep_ctrl:_gnix_dgram_wc_post():312<trace>
    [149618:1] 
    libfabric:149618:gni:ep_ctrl:_gnix_dgram_alloc():244<trace> [149618:1] 
    libfabric:149618:gni:ep_ctrl:_gnix_dgram_wc_post():312<trace>
    [149618:1] 
    libfabric:149618:gni:ep_ctrl:_gnix_dgram_alloc():244<trace> [149618:1] 
    libfabric:149618:gni:ep_ctrl:_gnix_dgram_wc_post():312<trace>
    [149618:1] 
    libfabric:149618:gni:ep_ctrl:_gnix_prog_obj_add():101<info> [149618:1]
    Added obj(0xbe4170) to set(0xbb91c8)
    libfabric:149618:gni:ep_ctrl:_gnix_prog_obj_add():101<info> [149618:1]
    Added obj(0xbc1e40) to set(0xbb91c8)
    libfabric:149618:gni:mr:_gnix_mr_reg():222<trace> [149618:1] 
    libfabric:149618:gni:mr:_gnix_mr_reg():224<info> [149618:1] reg:
    buf=0x1e83620 len=8192
    libfabric:149618:gni:mr:__udreg_init():824<warn> [149618:1] Could not
    initialize udreg application cache, urc=2
    libfabric:149618:gni:ep_data:_gnix_ep_int_tx_pool_grow():112<warn>
    [149618:1] gnix_mr_req returned: Invalid argument
    
    
    _______________________________________________
    Libfabric-users mailing list
    Libfabric-users at lists.openfabrics.org
    https://lists.openfabrics.org/mailman/listinfo/libfabric-users
    



More information about the Libfabric-users mailing list