[libfabric-users] connection-less send/recv with verbs
Hefty, Sean
sean.hefty at intel.com
Thu Jul 13 10:00:54 PDT 2017
> The scenario:
> The code is an all-to-all network of processes, with connection-less
> send/recv communication.
> All addresses and services are known statically at start time.
> Each process has an endpoint, to which it posts both send and recv
> requests (via fi_send/fi_recv); the endpoint is created from a fabric
> that is created by passing its address, its service and FI_SOURCE flag
> to fi_getinfo.
> Then each process fills an AV table with address/service of all the
> other nodes.
>
> The problem:
> With verbs, the code crashes on the first call to fi_recv, with the
> following call stack:
> fi_recv - fi_ibv_rdm_recv - fi_ibv_rdm_recvmsg -
> fi_ibv_rdm_init_recv_request
>
> Do you have any idea about what is going on? If it helps, I can
> recompile libfabric with some options for debugging.
Do you have a backtrace available? This sounds like a possible null pointer dereference.
If you have access to 1.5.0rc1, you can try using the "ofi-rxm:verbs" provider combination instead of the verbs rdm support. Verbs rdm support has limited testing and specifically targets Intel MPI use.
The only other idea I have without more details is to ensure that the endpoint has been enabled prior to posting receive buffers.
- Sean
More information about the Libfabric-users
mailing list