[libfabric-users] RMA with gni provider

Kevan Rehm krehm at cray.com
Fri Aug 23 09:59:29 PDT 2019


Rob,

What are you using for the value of the  dest_addr field in fi_write?   It looks like it is 0xffffffffff, which is an invalid task address, GNI tries to look that address up in its SRC_ADDR-to-raw-address map and doesn't find a match, so it doesn't know which remote task to communicate with.

It's possible, this is a bit of a guess, that when you place client and server on the same node, that you are using XPMEM to communicate with the server via shared memory, rather than going through the NIC.  Try setting
	export GNIX_DISABLE_XPMEM=1
and run it again, see if it now bombs when client and server are on the same node.

Kevan

´╗┐On 8/23/19, 11:21 AM, "Libfabric-users on behalf of Latham, Robert J. via Libfabric-users" <libfabric-users-bounces at lists.openfabrics.org on behalf of libfabric-users at lists.openfabrics.org> wrote:

    I'm a few libraries removed from libfabric in this code, but running
    out of debugging ideas.
    
    I'm running on a Cray and have one aprun placing a server on one set of
    nodes, and other aprun placing the client on another.  I've set up
    protection domains so the client and server can communicate with each
    other.
    
    I don't know exactly what I'm seeing here but it looks like point-to-
    point is working enough that the client can ask the server to RDMA some
    data to it, and the rdma fails.
    
    I've attached the FI_DEBUG logs from the client (57763) and server
    (21775).  Here are
    the last few lines from the server. The calling code is issuing an
    fi_write() call that returns -2:
    
    libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1] 
    libfabric:21775:gni:mr:_gnix_mr_reg():222<trace> [21775:5] 
    libfabric:21775:gni:mr:_gnix_mr_reg():224<info> [21775:5] reg:
    buf=0x2aaab0000b00 len=63
    libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1] 
    libfabric:21775:gni:mr:__udreg_register():768<info> [21775:5]
    info=0xa2d170 auth_key=0xa29d30
    libfabric:21775:gni:mr:__udreg_register():769<info> [21775:5] ptag=192
    libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1] 
    libfabric:21775:gni:ep_ctrl:_gnix_vc_ep_get_vc():2167<trace> [21775:5] 
    libfabric:21775:gni:av:_gnix_av_lookup():516<trace> [21775:5] 
    libfabric:21775:gni:ep_data:__gnix_vc_lookup_unmapped():168<warn>
    [21775:5] _gnix_av_lookup for addr 0xffffffffffffffff returned No such
    file or directory
    libfabric:21775:gni:av:_gnix_av_lookup():516<trace> [21775:5] 
    libfabric:21775:gni:ep_data:__gnix_vc_get_vc_by_fi_addr():257<warn>
    [21775:5] _gnix_av_lookup for addr 0xffffffffffffffff returned No such
    file or directory 
    libfabric:21775:gni:ep_data:_gnix_vc_ep_get_vc():2174<warn> [21775:5]
    __gnix_vc_get_vc_by_fi_addr returned No such file or directory
    libfabric:21775:gni:ep_data:_gnix_rma():1478<info> [21775:5]
    _gnix_vc_ep_get_vc() failed, addr: ffffffffffffffff, rc:-2
    libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1] 
    
    
    
    I was hoping to modify some of the prov/gni/test cases to show what I'm
    trying to do, but none of them run under separate aprun
    instances.  Things seem to work just fine if I am operating in a single
    apurn (for example, an MPI program that splits a communicator into
    client and serve is able to execute this code path).
    
    Thanks
    ==rob
    
    
    



More information about the Libfabric-users mailing list