<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><div class="">Thank you Sean,</div><div class=""><br class=""></div><div class="">I forgot to mention the version of libfabric I was using: git master, commit 8d192f2.</div><div class=""><br class=""></div><div class="">The backtrace is as follows (buf points to a 32-byte memory chunk):</div><div class=""><br class=""></div><div class="">#0 0x00007ffff798dd7e in fi_ibv_rdm_init_recv_request ()</div><div class=""> from /archive/home/mdrocco/usr/lib/libfabric.so.1</div><div class="">#1 0x00007ffff798c728 in fi_ibv_rdm_recvmsg ()</div><div class=""> from /archive/home/mdrocco/usr/lib/libfabric.so.1</div><div class="">#2 0x00007ffff798c8bf in fi_ibv_rdm_recv ()</div><div class=""> from /archive/home/mdrocco/usr/lib/libfabric.so.1</div><div class="">#3 0x0000000000402937 in fi_recv (ep=0x6303d0, buf=0x633810, len=32,</div><div class=""> desc=0x0, src_addr=18446744073709551615, context=0x0)</div><div class=""> at /archive/home/mdrocco/usr/include/rdma/fi_endpoint.h:263</div></div><div class=""><br class=""></div><div class="">I double-checked that the endpoint is enabled before calling fi_recv.</div><div class=""><br class=""></div><div class="">As soon as possible, I will try with 1.5.0rc1 as suggested.</div><br class=""><div class="">
<div style="color: rgb(0, 0, 0); letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class=""><div class="">---</div><div class="">Maurizio Drocco</div><div class="">PhD Candidate</div><div class="">University of Torino, department of Computer Science</div><div class="">Via Pessinetto 12, 10149 Torino - Italy</div></div></div>
</div>
<br class=""><div><blockquote type="cite" class=""><div class="">On 13 Jul 2017, at 19:00, Hefty, Sean <<a href="mailto:sean.hefty@intel.com" class="">sean.hefty@intel.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class=""><blockquote type="cite" class="">The scenario:<br class="">The code is an all-to-all network of processes, with connection-less<br class="">send/recv communication.<br class="">All addresses and services are known statically at start time.<br class="">Each process has an endpoint, to which it posts both send and recv<br class="">requests (via fi_send/fi_recv); the endpoint is created from a fabric<br class="">that is created by passing its address, its service and FI_SOURCE flag<br class="">to fi_getinfo.<br class="">Then each process fills an AV table with address/service of all the<br class="">other nodes.<br class=""><br class="">The problem:<br class="">With verbs, the code crashes on the first call to fi_recv, with the<br class="">following call stack:<br class="">fi_recv - fi_ibv_rdm_recv - fi_ibv_rdm_recvmsg -<br class="">fi_ibv_rdm_init_recv_request<br class=""><br class="">Do you have any idea about what is going on? If it helps, I can<br class="">recompile libfabric with some options for debugging.<br class=""></blockquote><br class="">Do you have a backtrace available? This sounds like a possible null pointer dereference.<br class=""><br class="">If you have access to 1.5.0rc1, you can try using the "ofi-rxm:verbs" provider combination instead of the verbs rdm support. Verbs rdm support has limited testing and specifically targets Intel MPI use.<br class=""><br class="">The only other idea I have without more details is to ensure that the endpoint has been enabled prior to posting receive buffers.<br class=""><br class="">- Sean<br class=""></div></div></blockquote></div><br class=""></body></html>