[libfabric-users] RMA with gni provider

Latham, Robert J. robl at mcs.anl.gov
Sat Aug 24 05:54:26 PDT 2019


On Fri, 2019-08-23 at 16:59 +0000, Kevan Rehm wrote:
> Rob,
> 
> What are you using for the value of the  dest_addr field in
> fi_write?   It looks like it is 0xffffffffff, which is an invalid
> task address, GNI tries to look that address up in its SRC_ADDR-to-
> raw-address map and doesn't find a match, so it doesn't know which
> remote task to communicate with.

Thanks for the follow up.  I tried a few things and in the end it
turned out to be that libfabric-1.8.0 gives me this erorr while
libfabric-from-master (looks like I have commit 26464d4ec15b, but also
spack's build of libfabric at develop worked too ) does not.

==rob

> It's possible, this is a bit of a guess, that when you place client
> and server on the same node, that you are using XPMEM to communicate
> with the server via shared memory, rather than going through the
> NIC.  Try setting
> 	export GNIX_DISABLE_XPMEM=1
> and run it again, see if it now bombs when client and server are on
> the same node.
> 
> Kevan
> 
> ´╗┐On 8/23/19, 11:21 AM, "Libfabric-users on behalf of Latham, Robert
> J. via Libfabric-users" <
> libfabric-users-bounces at lists.openfabrics.org on behalf of 
> libfabric-users at lists.openfabrics.org> wrote:
> 
>     I'm a few libraries removed from libfabric in this code, but
> running
>     out of debugging ideas.
>     
>     I'm running on a Cray and have one aprun placing a server on one
> set of
>     nodes, and other aprun placing the client on another.  I've set
> up
>     protection domains so the client and server can communicate with
> each
>     other.
>     
>     I don't know exactly what I'm seeing here but it looks like
> point-to-
>     point is working enough that the client can ask the server to
> RDMA some
>     data to it, and the rdma fails.
>     
>     I've attached the FI_DEBUG logs from the client (57763) and
> server
>     (21775).  Here are
>     the last few lines from the server. The calling code is issuing
> an
>     fi_write() call that returns -2:
>     
>     libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1] 
>     libfabric:21775:gni:mr:_gnix_mr_reg():222<trace> [21775:5] 
>     libfabric:21775:gni:mr:_gnix_mr_reg():224<info> [21775:5] reg:
>     buf=0x2aaab0000b00 len=63
>     libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1] 
>     libfabric:21775:gni:mr:__udreg_register():768<info> [21775:5]
>     info=0xa2d170 auth_key=0xa29d30
>     libfabric:21775:gni:mr:__udreg_register():769<info> [21775:5]
> ptag=192
>     libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1] 
>     libfabric:21775:gni:ep_ctrl:_gnix_vc_ep_get_vc():2167<trace>
> [21775:5] 
>     libfabric:21775:gni:av:_gnix_av_lookup():516<trace> [21775:5] 
>     libfabric:21775:gni:ep_data:__gnix_vc_lookup_unmapped():168<warn>
>     [21775:5] _gnix_av_lookup for addr 0xffffffffffffffff returned No
> such
>     file or directory
>     libfabric:21775:gni:av:_gnix_av_lookup():516<trace> [21775:5] 
>     libfabric:21775:gni:ep_data:__gnix_vc_get_vc_by_fi_addr():257<war
> n>
>     [21775:5] _gnix_av_lookup for addr 0xffffffffffffffff returned No
> such
>     file or directory 
>     libfabric:21775:gni:ep_data:_gnix_vc_ep_get_vc():2174<warn>
> [21775:5]
>     __gnix_vc_get_vc_by_fi_addr returned No such file or directory
>     libfabric:21775:gni:ep_data:_gnix_rma():1478<info> [21775:5]
>     _gnix_vc_ep_get_vc() failed, addr: ffffffffffffffff, rc:-2
>     libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1] 
>     
>     
>     
>     I was hoping to modify some of the prov/gni/test cases to show
> what I'm
>     trying to do, but none of them run under separate aprun
>     instances.  Things seem to work just fine if I am operating in a
> single
>     apurn (for example, an MPI program that splits a communicator
> into
>     client and serve is able to execute this code path).
>     
>     Thanks
>     ==rob
>     
>     
>     
> 



More information about the Libfabric-users mailing list