[libfabric-users] connection-less send/recv with verbs

Maurizio Drocco drocco at di.unito.it
Thu Jul 13 13:22:42 PDT 2017


Thank you Sean,

I forgot to mention the version of libfabric I was using: git master, commit 8d192f2.

The backtrace is as follows (buf points to a 32-byte memory chunk):

#0  0x00007ffff798dd7e in fi_ibv_rdm_init_recv_request ()
   from /archive/home/mdrocco/usr/lib/libfabric.so.1
#1  0x00007ffff798c728 in fi_ibv_rdm_recvmsg ()
   from /archive/home/mdrocco/usr/lib/libfabric.so.1
#2  0x00007ffff798c8bf in fi_ibv_rdm_recv ()
   from /archive/home/mdrocco/usr/lib/libfabric.so.1
#3  0x0000000000402937 in fi_recv (ep=0x6303d0, buf=0x633810, len=32,
    desc=0x0, src_addr=18446744073709551615, context=0x0)
    at /archive/home/mdrocco/usr/include/rdma/fi_endpoint.h:263

I double-checked that the endpoint is enabled before calling fi_recv.

As soon as possible, I will try with 1.5.0rc1 as suggested.

---
Maurizio Drocco
PhD Candidate
University of Torino, department of Computer Science
Via Pessinetto 12, 10149 Torino - Italy

> On 13 Jul 2017, at 19:00, Hefty, Sean <sean.hefty at intel.com> wrote:
> 
>> The scenario:
>> The code is an all-to-all network of processes, with connection-less
>> send/recv communication.
>> All addresses and services are known statically at start time.
>> Each process has an endpoint, to which it posts both send and recv
>> requests (via fi_send/fi_recv); the endpoint is created from a fabric
>> that is created by passing its address, its service and FI_SOURCE flag
>> to fi_getinfo.
>> Then each process fills an AV table with address/service of all the
>> other nodes.
>> 
>> The problem:
>> With verbs, the code crashes on the first call to fi_recv, with the
>> following call stack:
>> fi_recv - fi_ibv_rdm_recv - fi_ibv_rdm_recvmsg -
>> fi_ibv_rdm_init_recv_request
>> 
>> Do you have any idea about what is going on? If it helps, I can
>> recompile libfabric with some options for debugging.
> 
> Do you have a backtrace available?  This sounds like a possible null pointer dereference.
> 
> If you have access to 1.5.0rc1, you can try using the "ofi-rxm:verbs" provider combination instead of the verbs rdm support.  Verbs rdm support has limited testing and specifically targets Intel MPI use.
> 
> The only other idea I have without more details is to ensure that the endpoint has been enabled prior to posting receive buffers.
> 
> - Sean

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20170713/9c6a5153/attachment.html>


More information about the Libfabric-users mailing list