[libfabric-users] fi_read questions
Arne
arnestruck at astruck.de
Fri Oct 16 10:42:30 PDT 2020
>> I get the ENOENT from the fi_read call directly, which to me suggests
>> that it cant find the peer memory region.
> If fi_read() is returning an error code directly, the failure is local and the operation is never being queued or posted. If you can enable debug output, it may help identify where the error is coming from.
>
> Are you using MSG EPs? Or RDM EPs?
>
> I see in the code where the sockets provider will return ENOENT for RDM EPs if the peer's address cannot be found in the AV.
The ep_attr->type is set to FI_EP_MSG since the program I am modifying
was build with connection based communication in mind.
FI_CONNECTED events should have been present for the connection,
otherwise the program should have been throwing a warning into the shell.
Since I am not admin of the target system I unfortunately cant enable
debug output myself and have to request it (since libfabric needs to be
compiled with debug enabled).
Which could take a while.
I think it is curious that it is a local problem, since it only happens
if the client and server run on different nodes.
But if this is a local issue my only guess would be that libfabric
fi_read has issues with memory allocated via the slice mechanism of glib
and g_malloc which is how buf is allocated.
I could try out if this error still happens if I manually allocate
memory and use that instead.
More information about the Libfabric-users
mailing list