[libfabric-users] RMA with gni provider
Kevan Rehm
krehm at cray.com
Fri Aug 23 09:59:29 PDT 2019
Rob,
What are you using for the value of the dest_addr field in fi_write? It looks like it is 0xffffffffff, which is an invalid task address, GNI tries to look that address up in its SRC_ADDR-to-raw-address map and doesn't find a match, so it doesn't know which remote task to communicate with.
It's possible, this is a bit of a guess, that when you place client and server on the same node, that you are using XPMEM to communicate with the server via shared memory, rather than going through the NIC. Try setting
export GNIX_DISABLE_XPMEM=1
and run it again, see if it now bombs when client and server are on the same node.
Kevan
On 8/23/19, 11:21 AM, "Libfabric-users on behalf of Latham, Robert J. via Libfabric-users" <libfabric-users-bounces at lists.openfabrics.org on behalf of libfabric-users at lists.openfabrics.org> wrote:
I'm a few libraries removed from libfabric in this code, but running
out of debugging ideas.
I'm running on a Cray and have one aprun placing a server on one set of
nodes, and other aprun placing the client on another. I've set up
protection domains so the client and server can communicate with each
other.
I don't know exactly what I'm seeing here but it looks like point-to-
point is working enough that the client can ask the server to RDMA some
data to it, and the rdma fails.
I've attached the FI_DEBUG logs from the client (57763) and server
(21775). Here are
the last few lines from the server. The calling code is issuing an
fi_write() call that returns -2:
libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1]
libfabric:21775:gni:mr:_gnix_mr_reg():222<trace> [21775:5]
libfabric:21775:gni:mr:_gnix_mr_reg():224<info> [21775:5] reg:
buf=0x2aaab0000b00 len=63
libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1]
libfabric:21775:gni:mr:__udreg_register():768<info> [21775:5]
info=0xa2d170 auth_key=0xa29d30
libfabric:21775:gni:mr:__udreg_register():769<info> [21775:5] ptag=192
libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1]
libfabric:21775:gni:ep_ctrl:_gnix_vc_ep_get_vc():2167<trace> [21775:5]
libfabric:21775:gni:av:_gnix_av_lookup():516<trace> [21775:5]
libfabric:21775:gni:ep_data:__gnix_vc_lookup_unmapped():168<warn>
[21775:5] _gnix_av_lookup for addr 0xffffffffffffffff returned No such
file or directory
libfabric:21775:gni:av:_gnix_av_lookup():516<trace> [21775:5]
libfabric:21775:gni:ep_data:__gnix_vc_get_vc_by_fi_addr():257<warn>
[21775:5] _gnix_av_lookup for addr 0xffffffffffffffff returned No such
file or directory
libfabric:21775:gni:ep_data:_gnix_vc_ep_get_vc():2174<warn> [21775:5]
__gnix_vc_get_vc_by_fi_addr returned No such file or directory
libfabric:21775:gni:ep_data:_gnix_rma():1478<info> [21775:5]
_gnix_vc_ep_get_vc() failed, addr: ffffffffffffffff, rc:-2
libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1]
I was hoping to modify some of the prov/gni/test cases to show what I'm
trying to do, but none of them run under separate aprun
instances. Things seem to work just fine if I am operating in a single
apurn (for example, an MPI program that splits a communicator into
client and serve is able to execute this code path).
Thanks
==rob
More information about the Libfabric-users
mailing list