[libfabric-users] RMA with gni provider

Latham, Robert J. robl at mcs.anl.gov
Fri Aug 23 09:21:30 PDT 2019


I'm a few libraries removed from libfabric in this code, but running
out of debugging ideas.

I'm running on a Cray and have one aprun placing a server on one set of
nodes, and other aprun placing the client on another.  I've set up
protection domains so the client and server can communicate with each
other.

I don't know exactly what I'm seeing here but it looks like point-to-
point is working enough that the client can ask the server to RDMA some
data to it, and the rdma fails.

I've attached the FI_DEBUG logs from the client (57763) and server
(21775).  Here are
the last few lines from the server. The calling code is issuing an
fi_write() call that returns -2:

libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1] 
libfabric:21775:gni:mr:_gnix_mr_reg():222<trace> [21775:5] 
libfabric:21775:gni:mr:_gnix_mr_reg():224<info> [21775:5] reg:
buf=0x2aaab0000b00 len=63
libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1] 
libfabric:21775:gni:mr:__udreg_register():768<info> [21775:5]
info=0xa2d170 auth_key=0xa29d30
libfabric:21775:gni:mr:__udreg_register():769<info> [21775:5] ptag=192
libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1] 
libfabric:21775:gni:ep_ctrl:_gnix_vc_ep_get_vc():2167<trace> [21775:5] 
libfabric:21775:gni:av:_gnix_av_lookup():516<trace> [21775:5] 
libfabric:21775:gni:ep_data:__gnix_vc_lookup_unmapped():168<warn>
[21775:5] _gnix_av_lookup for addr 0xffffffffffffffff returned No such
file or directory
libfabric:21775:gni:av:_gnix_av_lookup():516<trace> [21775:5] 
libfabric:21775:gni:ep_data:__gnix_vc_get_vc_by_fi_addr():257<warn>
[21775:5] _gnix_av_lookup for addr 0xffffffffffffffff returned No such
file or directory 
libfabric:21775:gni:ep_data:_gnix_vc_ep_get_vc():2174<warn> [21775:5]
__gnix_vc_get_vc_by_fi_addr returned No such file or directory
libfabric:21775:gni:ep_data:_gnix_rma():1478<info> [21775:5]
_gnix_vc_ep_get_vc() failed, addr: ffffffffffffffff, rc:-2
libfabric:21775:gni:eq:gnix_wait_wait():456<trace> [21775:1] 



I was hoping to modify some of the prov/gni/test cases to show what I'm
trying to do, but none of them run under separate aprun
instances.  Things seem to work just fine if I am operating in a single
apurn (for example, an MPI program that splits a communicator into
client and serve is able to execute this code path).

Thanks
==rob


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: libfabric-21775.out
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20190823/92ca6cbe/attachment.ksh>


More information about the Libfabric-users mailing list