<html dir="ltr">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" id="owaParaStyle">P {margin-top:0;margin-bottom:0;}</style>
</head>
<body fpstyle="1" ocsi="0">
<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">I've spent the last few days trying to track down a bug in our code and am now suspecting a bug in libfabric<br>
<br>
The conditions are as follows ...<br>
<br>
A memory block is allocated, registered with fi_mem_reg and used as the local destination for a call to fi_read(blah) and I receive the data that I expect without any problem.<br>
(the memory block has address 0x00002aaad5200000 and memm desc 0x00002aaad423e910)<br>
<br>
The memory block is now deregistered and freed back to the heap. All is well.<br>
<br>
However, I now receive enother request for a block of the same size and I allocate one from the heap, register it with fi_mem_reg and as luck would have it, I get the same memory address for the heap block (0x00002aaad5200000) and after registration, I get
the same memory descriptor (0x00002aaad423e910).<br>
This time, I dump the contents of memory out immediately before calling fi_read, (I have filled it with 0xdeadbeef, and immediately after I receive the read completion, I dump it out. it is still 0xdeadbeef.<br>
<br>
It would appear that the fi_read completes successfully, but there is no memory transferred.<br>
<br>
I have a sneaking suspicion that inside libfabric there is a problem related to an ABA race condition (but in this case I can reproduce it on one thread at each end of the connection) where the memory address or descriptor is being used to match some internal
event and is being mis-flagged as completed when it has not.<br>
<br>
I can verify (to a limited extent) that the bug is independent of my code by inserting a malloc just before the second memory allocation from the heap to get another block of the same size, and then a free immediately after allocating the block I actually want.
this changes the memory address of the block I use in fi_read and then the code completes without error.<br>
<br>
Is there any further test I can perform that might conclusively demonstrate that libfrabric is at fault rather than some obscure bug in our code?<br>
<br>
many thnaks<br>
<br>
JB<br>
<br>
<br>
<br>
</div>
</body>
</html>