[libfabric-users] suspecting an ABA bug in libfabric somewhere

Biddiscombe, John A. biddisco at cscs.ch
Mon Mar 27 16:32:58 PDT 2017


I've spent the last few days trying to track down a bug in our code and am now suspecting a bug in libfabric

The conditions are as follows ...

A memory block is allocated, registered with fi_mem_reg and used as the local destination for a call to fi_read(blah) and I receive the data that I expect without any problem.
(the memory block has address  0x00002aaad5200000 and memm desc 0x00002aaad423e910)

The memory block is now deregistered and freed back to the heap. All is well.

However, I now receive enother request for a block of the same size and I allocate one from the heap, register it with fi_mem_reg and as luck would have it, I get the same memory address for the heap block (0x00002aaad5200000) and after registration, I get the same memory descriptor (0x00002aaad423e910).
This time, I dump the contents of memory out immediately before calling fi_read, (I have filled it with 0xdeadbeef, and immediately after I receive the read completion, I dump it out. it is still 0xdeadbeef.

It would appear that the fi_read completes successfully, but there is no memory transferred.

I have a sneaking suspicion that inside libfabric there is a problem related to an ABA race condition (but in this case I can reproduce it on one thread at each end of the connection) where the memory address or descriptor is being used to match some internal event and is being mis-flagged as completed when it has not.

I can verify (to a limited extent) that the bug is independent of my code by inserting a malloc just before the second memory allocation from the heap to get another block of the same size, and then a free immediately after allocating the block I actually want. this changes the memory address of the block I use in fi_read and then the code completes without error.

Is there any further test I can perform that might conclusively demonstrate that libfrabric is at fault rather than some obscure bug in our code?

many thnaks

JB



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20170327/0e80196c/attachment.html>


More information about the Libfabric-users mailing list