[libfabric-users] suspecting an ABA bug in libfabric somewhere

Biddiscombe, John A. biddisco at cscs.ch
Thu Mar 30 11:05:54 PDT 2017


Just to follow up on my own query, (though I suspect nobody is interested ...)

Here is a log from my code, with the FI_LOG_LEVEL=debug set,

In lines 15157-15161 you can see a memory registration entry for a block, which I'm filling with 0xff (only the first N bytes are dumped out)
on 15162 I post an fi_read with the destination (local address) set to the address of the block I've just registered

on 15164 I receive an RM completion event for the fi_read call and I match it back up to the buffer using the context and dump out the contents which are still 0xff

Does the trace in between these lines - indicate any kind of error on my part, or anything else that could shed light on why I am receiving a completion for an event that does not seem to have really happened?
I receive no errors from libfabric, and the trace does not have any evidence of them - though the transfer of 1 byte appears

(Looking at the trace from the other node, I cannot see anything that looks like it corresponds to this event - we would not expect the other node to 'know' that a remote read had taken place, but nothing appears in the log that gives me a clue to it)

Many thanks for any insight - I'm totally at a loss as to what might have gone wrong.

JB

15157: <debug> 0x00002aaad3c29f20 0x2aaad3a00700 OK registering fi_mr_reg 0x00002aaad6c00000 0x00002aaad6c00000 desc 0x00002aaad3cece90 rkey 0x00003362aaad1ecd length 0xf0bea9
15158: <debug> 0x00002aaad3c29f20 0x2aaad3a00700 allocated/registered memory region 0x00002aaad3cefed0 with desc 0x2aaad3cece90 at address 0x00002aaad6c00000 with length 0xf0bea9
15159: <trace> 0x00002aaad3c29f20 0x2aaad3a00700 Allocating temp region region 0x00002aaad3cefed0 addr 0x00002aaad6c00000 length 0xf0bea9 temp regions 2
15160: <trace> 0x00002aaad3c29f20 0x2aaad3a00700 Popping Block buffer 0x00002aaad6c00000  region 0x00002aaad3cefed0  size 0xf0bea9  chunksize 0x000400  0x001000  0x010000  0x100000  free (t) 1024  used 0 free (s) 1280  used 768 free (m) 128  used 0 free (l) 16  used 0
15161: <trace> 0x00002aaad3c29f20 0x2aaad3a00700 Memory: address 0x00002aaad6c00000 length 0x00f0bea9 CRC32: 0xddfd5b64 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff : (RDMA message region (pre-fi_read))
15162: <debug> 0x00002aaad3c29f20 0x2aaad3a00700 receiver 0x00002aaab5e3af80 RDMA Get fi_read message :client 0x00002aaab5e8b300 fi_addr 0x006e200200002371 local addr 0x00002aaad6c00000 local desc 0x00002aaad3cece90 remote addr 0x00002aaad8c00000 rkey 0x00003362aaad3111 size 0xf0bea9
libfabric:gni:ep_data:_gnix_rma():1382<info> [5701:2] Using CT for unaligned GET, req: 0x2aaab6bd0280
libfabric:gni:ep_ctrl:_gnix_vc_ep_get_vc():2151<trace> [5701:2]
libfabric:gni:ep_ctrl:__gnix_vc_get_vc_by_fi_addr():218<trace> [5701:2]
15163: <trace> 0x00002aaad3c29f20 0x2aaad3a00700 ### Exit  handle_recv
libfabric:gni:ep_data:__gnix_rma_copy_chained_get_data():159<info> [5701:2] writing 1 bytes to 0x2aaad7b0bea8
libfabric:gni:ep_data:__gnix_rma_txd_complete():469<info> [5701:2] Received first RDMA chain TXD, req: 0x2aaab6bd0280
libfabric:gni:ep_ctrl:_gnix_dgram_poll():436<trace> [5701:2]
libfabric:gni:ep_ctrl:_gnix_dgram_poll():436<trace> [5701:2]
libfabric:gni:ep_ctrl:__gnix_nic_next_pending_vc():2049<info> [5701:2] Dequeued progress VC (0x2aaab6bf0310)
libfabric:gni:ep_data:_gnix_vc_dequeue_smsg():1734<trace> [5701:2]
libfabric:gni:ep_data:__gnix_rma_send_data_req():381<info> [5701:2] Sent RMA CQ data, req: 0x2aaab6bd0280
libfabric:gni:ep_data:__gnix_vc_push_work_reqs():1865<info> [5701:2] Request processed: 0x2aaab6bd0280
libfabric:gni:ep_ctrl:__gnix_nic_next_pending_vc():2049<info> [5701:2] Dequeued progress VC (0x2aaab6bf0310)
libfabric:gni:ep_data:_gnix_vc_dequeue_smsg():1734<trace> [5701:2]
15164: <debug> 0x00002aaad3c29f20 0x2aaad3a00700 Completion txcq wr_id FI_RMA, FI_READ (260 ) context 0x00002aaab5e3af80 length 0x00000000
15165: <debug> 0x00002aaad3c29f20 0x2aaad3a00700 Received a txcq RMA completion
15166: <trace> 0x00002aaad3c29f20 0x2aaad3a00700 *** Enter handle_rma_read_completion
15167: <debug> 0x00002aaad3c29f20 0x2aaad3a00700 receiver 0x00002aaab5e3af80 all RMA regions now read
15168: <debug> 0x00002aaad3c29f20 0x2aaad3a00700 receiver 0x00002aaab5e3af80 No piggy_back RDMA message region 0x00002aaad3cefed0 address 0x00002aaad6c00000 length 0x00f0bea9
15169: <trace> 0x00002aaad3c29f20 0x2aaad3a00700 Memory: address 0x00002aaad6c00000 length 0x00f0bea9 CRC32: 0xaafa6bf2 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff 0xffffffffffffffff : Message region (recv rdma)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20170330/de2faa28/attachment.html>


More information about the Libfabric-users mailing list