[libfabric-users] suspecting an ABA bug in libfabric somewhere
Biddiscombe, John A.
biddisco at cscs.ch
Mon Mar 27 16:32:58 PDT 2017
I've spent the last few days trying to track down a bug in our code and am now suspecting a bug in libfabric
The conditions are as follows ...
A memory block is allocated, registered with fi_mem_reg and used as the local destination for a call to fi_read(blah) and I receive the data that I expect without any problem.
(the memory block has address 0x00002aaad5200000 and memm desc 0x00002aaad423e910)
The memory block is now deregistered and freed back to the heap. All is well.
However, I now receive enother request for a block of the same size and I allocate one from the heap, register it with fi_mem_reg and as luck would have it, I get the same memory address for the heap block (0x00002aaad5200000) and after registration, I get the same memory descriptor (0x00002aaad423e910).
This time, I dump the contents of memory out immediately before calling fi_read, (I have filled it with 0xdeadbeef, and immediately after I receive the read completion, I dump it out. it is still 0xdeadbeef.
It would appear that the fi_read completes successfully, but there is no memory transferred.
I have a sneaking suspicion that inside libfabric there is a problem related to an ABA race condition (but in this case I can reproduce it on one thread at each end of the connection) where the memory address or descriptor is being used to match some internal event and is being mis-flagged as completed when it has not.
I can verify (to a limited extent) that the bug is independent of my code by inserting a malloc just before the second memory allocation from the heap to get another block of the same size, and then a free immediately after allocating the block I actually want. this changes the memory address of the block I use in fi_read and then the code completes without error.
Is there any further test I can perform that might conclusively demonstrate that libfrabric is at fault rather than some obscure bug in our code?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Libfabric-users