[libfabric-users] GNI: "invalid argument" when more than one client on node
Kevan Rehm
krehm at cray.com
Fri Sep 13 13:42:33 PDT 2019
Robert,
Caveat emptor, this information only comes from reading code, not actual experience. The following message:
libfabric:149618:gni:mr:__udreg_init():824<warn> [149618:1] Could not initialize udreg application cache, urc=2
means that UDREG_CacheCreate() was called with a cache_attr.max_entries value which exceeds the remaining number of available kernel kdreg kcache entries. cache_attr.modes also has the UDREG_MODE_USE_KERNEL_CACHE bit set, which implies that the symbol HAVE_KDREG was defined when libfabric was built, which implies that the "--with-kdreg" configuration option was set to something other than "no". The value urc=2 corresponds to enum UDREG_RC_ERROR_RESOURCE in udreg_pub.h.
By default gni libfabric picks the value 2048 for cache_attr.max_entries, which it gets from domain.udreg_reg_limit. The comment associated with the initialization of udreg_reg_limit says "we are likely sharing udreg entries with Craypich if we're using udreg cache so only ask for half the entries by default".
If you are running multiple processes on the same node, then the total of all the entries in use by all the processes must exceed the maximum number available in the kernel. Does the first process succeed, then the others fail?
You can reduce the value of udreg_reg_limit, see the fi_gni man page where it talks about the domain symbol "GNI_MR_UDREG_LIMIT". Setting it to a lower value like 1024 should allow additional processes to succeed. Alternatively you could reconfigure libfabric by specifying "--with-kdreg=no" and the problem should go away.
Good luck,
Kevan
On 9/12/19, 8:59 PM, "Libfabric-users on behalf of Latham, Robert J. via Libfabric-users" <libfabric-users-bounces at lists.openfabrics.org on behalf of libfabric-users at lists.openfabrics.org> wrote:
I have a distributed service using libfabric on our Cray that seems to
work ok as long as it is just one client. If I have two servers and
two clients, I get an error about invalid flags passed to
gnix_mr_reg. The trace (last few lines of which I have included below)
ends at this check for several parameters but I don't know which one
was invalid (yet. I'll patch in some debugging and find out more in
the morning)
https://github.com/ofiwg/libfabric/blob/master/prov/gni/src/gnix_mr.c#L230
We got into this path from a call to 'fi_enable', which only takes one
argument and doesn't seem like the kind of routine I can call
incorrectly.
Any suggestions what I'm doing wrong here?
libfabric:134822:gni:fabric:_gnix_resolve_gni_ep_name():120<trace>
[134822:1]
libfabric:134822:gni:ep_ctrl:_gnix_cm_nic_alloc():628<info> [134822:1]
creating cm_nic for 219/0x44710000/15360001
libfabric:134822:gni:ep_ctrl:gnix_nic_alloc():954<trace> [134822:1]
libfabric:149618:gni:ep_ctrl:gnix_ep_bind():1813<trace> [149618:1]
libfabric:149618:gni:ep_ctrl:gnix_ep_bind():1813<trace> [149618:1]
libfabric:149618:gni:ep_ctrl:gnix_ep_control():1529<trace> [149618:1]
libfabric:149618:gni:ep_ctrl:_gnix_vc_cm_init():2217<trace> [149618:1]
libfabric:149618:gni:ep_ctrl:_gnix_cm_nic_reg_recv_fn():505<trace>
[149618:1]
libfabric:149618:gni:ep_ctrl:_gnix_cm_nic_enable():523<trace>
[149618:1]
libfabric:149618:gni:ep_ctrl:_gnix_dgram_alloc():244<trace> [149618:1]
libfabric:149618:gni:ep_ctrl:_gnix_dgram_wc_post():312<trace>
[149618:1]
libfabric:149618:gni:ep_ctrl:_gnix_dgram_alloc():244<trace> [149618:1]
libfabric:149618:gni:ep_ctrl:_gnix_dgram_wc_post():312<trace>
[149618:1]
libfabric:149618:gni:ep_ctrl:_gnix_dgram_alloc():244<trace> [149618:1]
libfabric:149618:gni:ep_ctrl:_gnix_dgram_wc_post():312<trace>
[149618:1]
libfabric:149618:gni:ep_ctrl:_gnix_dgram_alloc():244<trace> [149618:1]
libfabric:149618:gni:ep_ctrl:_gnix_dgram_wc_post():312<trace>
[149618:1]
libfabric:149618:gni:ep_ctrl:_gnix_prog_obj_add():101<info> [149618:1]
Added obj(0xbe4170) to set(0xbb91c8)
libfabric:149618:gni:ep_ctrl:_gnix_prog_obj_add():101<info> [149618:1]
Added obj(0xbc1e40) to set(0xbb91c8)
libfabric:149618:gni:mr:_gnix_mr_reg():222<trace> [149618:1]
libfabric:149618:gni:mr:_gnix_mr_reg():224<info> [149618:1] reg:
buf=0x1e83620 len=8192
libfabric:149618:gni:mr:__udreg_init():824<warn> [149618:1] Could not
initialize udreg application cache, urc=2
libfabric:149618:gni:ep_data:_gnix_ep_int_tx_pool_grow():112<warn>
[149618:1] gnix_mr_req returned: Invalid argument
_______________________________________________
Libfabric-users mailing list
Libfabric-users at lists.openfabrics.org
https://lists.openfabrics.org/mailman/listinfo/libfabric-users
More information about the Libfabric-users
mailing list