[libfabric-users] RMA with gni provider
Latham, Robert J.
robl at mcs.anl.gov
Fri Aug 23 09:21:30 PDT 2019
I'm a few libraries removed from libfabric in this code, but running
out of debugging ideas.
I'm running on a Cray and have one aprun placing a server on one set of
nodes, and other aprun placing the client on another. I've set up
protection domains so the client and server can communicate with each
other.
I don't know exactly what I'm seeing here but it looks like point-to-
point is working enough that the client can ask the server to RDMA some
data to it, and the rdma fails.
I've attached the FI_DEBUG logs from the client (57763) and server
(21775). Here are
the last few lines from the server. The calling code is issuing an
fi_write() call that returns -2:
libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1]
libfabric:21775:gni:mr:_gnix_mr_reg():222<trace> [21775:5]
libfabric:21775:gni:mr:_gnix_mr_reg():224<info> [21775:5] reg:
buf=0x2aaab0000b00 len=63
libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1]
libfabric:21775:gni:mr:__udreg_register():768<info> [21775:5]
info=0xa2d170 auth_key=0xa29d30
libfabric:21775:gni:mr:__udreg_register():769<info> [21775:5] ptag=192
libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1]
libfabric:21775:gni:ep_ctrl:_gnix_vc_ep_get_vc():2167<trace> [21775:5]
libfabric:21775:gni:av:_gnix_av_lookup():516<trace> [21775:5]
libfabric:21775:gni:ep_data:__gnix_vc_lookup_unmapped():168<warn>
[21775:5] _gnix_av_lookup for addr 0xffffffffffffffff returned No such
file or directory
libfabric:21775:gni:av:_gnix_av_lookup():516<trace> [21775:5]
libfabric:21775:gni:ep_data:__gnix_vc_get_vc_by_fi_addr():257<warn>
[21775:5] _gnix_av_lookup for addr 0xffffffffffffffff returned No such
file or directory
libfabric:21775:gni:ep_data:_gnix_vc_ep_get_vc():2174<warn> [21775:5]
__gnix_vc_get_vc_by_fi_addr returned No such file or directory
libfabric:21775:gni:ep_data:_gnix_rma():1478<info> [21775:5]
_gnix_vc_ep_get_vc() failed, addr: ffffffffffffffff, rc:-2
libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1]
I was hoping to modify some of the prov/gni/test cases to show what I'm
trying to do, but none of them run under separate aprun
instances. Things seem to work just fine if I am operating in a single
apurn (for example, an MPI program that splits a communicator into
client and serve is able to execute this code path).
Thanks
==rob
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: libfabric-21775.out
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20190823/92ca6cbe/attachment.ksh>
More information about the Libfabric-users
mailing list