[libfabric-users] GNI: "invalid argument" when more than one client on node

Howard Pritchard hppritcha at gmail.com
Sat Sep 14 07:51:02 PDT 2019


Hi Robert

You may be able to get away with not using this kdreg feature.  It’s there
to protect the mr cache from under the cover munmaps etc that may happen to
buffers allocated via Malloc and later freed by the app.

Try adding

—with-kdreg=no

To the configure line and see if that help.

Some day when there’s nothing else to do we may switch GNI provider to
using the common MR cache and avoid this problem.



Howard

Kevan Rehm <krehm at cray.com> schrieb am Fr. 13. Sep. 2019 um 14:42:

> Robert,
>
> Caveat emptor, this information only comes from reading code, not actual
> experience.  The following message:
>
>         libfabric:149618:gni:mr:__udreg_init():824<warn> [149618:1] Could
> not initialize udreg application cache, urc=2
>
> means that UDREG_CacheCreate() was called with a cache_attr.max_entries
> value which exceeds the remaining number of available kernel kdreg kcache
> entries.  cache_attr.modes also has the UDREG_MODE_USE_KERNEL_CACHE bit
> set, which implies that the symbol HAVE_KDREG was defined when libfabric
> was built, which implies that the "--with-kdreg" configuration option was
> set to something other than "no".   The value urc=2 corresponds to enum
> UDREG_RC_ERROR_RESOURCE in udreg_pub.h.
>
> By default gni libfabric picks the value 2048 for cache_attr.max_entries,
> which it gets from domain.udreg_reg_limit.   The comment associated with
> the initialization of udreg_reg_limit says "we are likely sharing udreg
> entries with Craypich if we're using udreg cache so only ask for half the
> entries by default".
>
> If you are running multiple processes on the same node, then the total of
> all the entries in use by all the processes must exceed the maximum number
> available in the kernel.   Does the first process succeed, then the others
> fail?
>
> You can reduce the value of udreg_reg_limit, see the fi_gni man page where
> it talks about the domain symbol "GNI_MR_UDREG_LIMIT".   Setting it to a
> lower value like 1024 should allow additional processes to succeed.
>  Alternatively you could reconfigure libfabric by specifying
> "--with-kdreg=no" and the problem should go away.
>
> Good luck,
>
> Kevan
>
> On 9/12/19, 8:59 PM, "Libfabric-users on behalf of Latham, Robert J. via
> Libfabric-users" <libfabric-users-bounces at lists.openfabrics.org on behalf
> of libfabric-users at lists.openfabrics.org> wrote:
>
>     I have a distributed service using libfabric on our Cray that seems to
>     work ok as long as it is just one client.   If I have two servers and
>     two clients, I get an error about invalid flags passed to
>     gnix_mr_reg.  The trace (last few lines of which I have included below)
>     ends at this check for several parameters but I don't know which one
>     was invalid (yet.  I'll patch in some debugging and find out more in
>     the morning)
>
>
>
> https://github.com/ofiwg/libfabric/blob/master/prov/gni/src/gnix_mr.c#L230
>
>     We got into this path from a call to 'fi_enable', which only takes one
>     argument and doesn't seem like the kind of routine I can call
>     incorrectly.
>
>     Any suggestions what I'm doing wrong here?
>
>     libfabric:134822:gni:fabric:_gnix_resolve_gni_ep_name():120<trace>
>     [134822:1]
>     libfabric:134822:gni:ep_ctrl:_gnix_cm_nic_alloc():628<info> [134822:1]
>     creating cm_nic for 219/0x44710000/15360001
>     libfabric:134822:gni:ep_ctrl:gnix_nic_alloc():954<trace> [134822:1]
>     libfabric:149618:gni:ep_ctrl:gnix_ep_bind():1813<trace> [149618:1]
>     libfabric:149618:gni:ep_ctrl:gnix_ep_bind():1813<trace> [149618:1]
>     libfabric:149618:gni:ep_ctrl:gnix_ep_control():1529<trace> [149618:1]
>     libfabric:149618:gni:ep_ctrl:_gnix_vc_cm_init():2217<trace> [149618:1]
>     libfabric:149618:gni:ep_ctrl:_gnix_cm_nic_reg_recv_fn():505<trace>
>     [149618:1]
>     libfabric:149618:gni:ep_ctrl:_gnix_cm_nic_enable():523<trace>
>     [149618:1]
>     libfabric:149618:gni:ep_ctrl:_gnix_dgram_alloc():244<trace> [149618:1]
>     libfabric:149618:gni:ep_ctrl:_gnix_dgram_wc_post():312<trace>
>     [149618:1]
>     libfabric:149618:gni:ep_ctrl:_gnix_dgram_alloc():244<trace> [149618:1]
>     libfabric:149618:gni:ep_ctrl:_gnix_dgram_wc_post():312<trace>
>     [149618:1]
>     libfabric:149618:gni:ep_ctrl:_gnix_dgram_alloc():244<trace> [149618:1]
>     libfabric:149618:gni:ep_ctrl:_gnix_dgram_wc_post():312<trace>
>     [149618:1]
>     libfabric:149618:gni:ep_ctrl:_gnix_dgram_alloc():244<trace> [149618:1]
>     libfabric:149618:gni:ep_ctrl:_gnix_dgram_wc_post():312<trace>
>     [149618:1]
>     libfabric:149618:gni:ep_ctrl:_gnix_prog_obj_add():101<info> [149618:1]
>     Added obj(0xbe4170) to set(0xbb91c8)
>     libfabric:149618:gni:ep_ctrl:_gnix_prog_obj_add():101<info> [149618:1]
>     Added obj(0xbc1e40) to set(0xbb91c8)
>     libfabric:149618:gni:mr:_gnix_mr_reg():222<trace> [149618:1]
>     libfabric:149618:gni:mr:_gnix_mr_reg():224<info> [149618:1] reg:
>     buf=0x1e83620 len=8192
>     libfabric:149618:gni:mr:__udreg_init():824<warn> [149618:1] Could not
>     initialize udreg application cache, urc=2
>     libfabric:149618:gni:ep_data:_gnix_ep_int_tx_pool_grow():112<warn>
>     [149618:1] gnix_mr_req returned: Invalid argument
>
>
>     _______________________________________________
>     Libfabric-users mailing list
>     Libfabric-users at lists.openfabrics.org
>     https://lists.openfabrics.org/mailman/listinfo/libfabric-users
>
>
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> https://lists.openfabrics.org/mailman/listinfo/libfabric-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20190914/2c5cd7f5/attachment-0001.html>


More information about the Libfabric-users mailing list