[libfabric-users] RMA with gni provider
Latham, Robert J.
robl at mcs.anl.gov
Sat Aug 24 05:54:26 PDT 2019
On Fri, 2019-08-23 at 16:59 +0000, Kevan Rehm wrote:
> Rob,
>
> What are you using for the value of the dest_addr field in
> fi_write? It looks like it is 0xffffffffff, which is an invalid
> task address, GNI tries to look that address up in its SRC_ADDR-to-
> raw-address map and doesn't find a match, so it doesn't know which
> remote task to communicate with.
Thanks for the follow up. I tried a few things and in the end it
turned out to be that libfabric-1.8.0 gives me this erorr while
libfabric-from-master (looks like I have commit 26464d4ec15b, but also
spack's build of libfabric at develop worked too ) does not.
==rob
> It's possible, this is a bit of a guess, that when you place client
> and server on the same node, that you are using XPMEM to communicate
> with the server via shared memory, rather than going through the
> NIC. Try setting
> export GNIX_DISABLE_XPMEM=1
> and run it again, see if it now bombs when client and server are on
> the same node.
>
> Kevan
>
> On 8/23/19, 11:21 AM, "Libfabric-users on behalf of Latham, Robert
> J. via Libfabric-users" <
> libfabric-users-bounces at lists.openfabrics.org on behalf of
> libfabric-users at lists.openfabrics.org> wrote:
>
> I'm a few libraries removed from libfabric in this code, but
> running
> out of debugging ideas.
>
> I'm running on a Cray and have one aprun placing a server on one
> set of
> nodes, and other aprun placing the client on another. I've set
> up
> protection domains so the client and server can communicate with
> each
> other.
>
> I don't know exactly what I'm seeing here but it looks like
> point-to-
> point is working enough that the client can ask the server to
> RDMA some
> data to it, and the rdma fails.
>
> I've attached the FI_DEBUG logs from the client (57763) and
> server
> (21775). Here are
> the last few lines from the server. The calling code is issuing
> an
> fi_write() call that returns -2:
>
> libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1]
> libfabric:21775:gni:mr:_gnix_mr_reg():222<trace> [21775:5]
> libfabric:21775:gni:mr:_gnix_mr_reg():224<info> [21775:5] reg:
> buf=0x2aaab0000b00 len=63
> libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1]
> libfabric:21775:gni:mr:__udreg_register():768<info> [21775:5]
> info=0xa2d170 auth_key=0xa29d30
> libfabric:21775:gni:mr:__udreg_register():769<info> [21775:5]
> ptag=192
> libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1]
> libfabric:21775:gni:ep_ctrl:_gnix_vc_ep_get_vc():2167<trace>
> [21775:5]
> libfabric:21775:gni:av:_gnix_av_lookup():516<trace> [21775:5]
> libfabric:21775:gni:ep_data:__gnix_vc_lookup_unmapped():168<warn>
> [21775:5] _gnix_av_lookup for addr 0xffffffffffffffff returned No
> such
> file or directory
> libfabric:21775:gni:av:_gnix_av_lookup():516<trace> [21775:5]
> libfabric:21775:gni:ep_data:__gnix_vc_get_vc_by_fi_addr():257<war
> n>
> [21775:5] _gnix_av_lookup for addr 0xffffffffffffffff returned No
> such
> file or directory
> libfabric:21775:gni:ep_data:_gnix_vc_ep_get_vc():2174<warn>
> [21775:5]
> __gnix_vc_get_vc_by_fi_addr returned No such file or directory
> libfabric:21775:gni:ep_data:_gnix_rma():1478<info> [21775:5]
> _gnix_vc_ep_get_vc() failed, addr: ffffffffffffffff, rc:-2
> libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1]
>
>
>
> I was hoping to modify some of the prov/gni/test cases to show
> what I'm
> trying to do, but none of them run under separate aprun
> instances. Things seem to work just fine if I am operating in a
> single
> apurn (for example, an MPI program that splits a communicator
> into
> client and serve is able to execute this code path).
>
> Thanks
> ==rob
>
>
>
>
More information about the Libfabric-users
mailing list