[libfabric-users] Heap use after free from completion queue data fi_cq_sread()

Arne arnestruck at astruck.de
Fri Feb 21 04:12:28 PST 2020


after completing another feature I went back to this bug and indeed, the 
timeout fi_cq_sread is using to be too short for some data transfers in 
the tests, so that the buffer I was giving the prior fi_recv was 
accessed by application before libfabric was done using it.

Now I have 2 problems arising from that. One of understanding and one of 
using libfabric/writing functionalities around it.

Understanding first:

The message in question contains a header of fixed size for future data 
entries. Other messages of variable size can be sent and received after 
the header information. The time libfabric needs to generate a 
completion queue entry (and therefore to complete receiving the header) 
seems to correlate to the total amount of data transfered. I did not 
find hints in the documentation or the guides that could explain this 
behaviour. Is that behaviour normal, at least for certain configurations 
with the socket provider or do I have another bug?

Using libfabric:

The obvious solution would be to set the timeout to -1 (if I do that 
everything works as expected). Problem is that this would shut down the 
graceful exit of the server, since the threads handling communication 
would be trapped in that blocking state.

On the other hand since the waiting time seems to be variable a large 
hardcoded timeout seems to be a bad idea, since it can be exceeded and 
makes the exit quite slow, since the timeout needs to finish first.

So is there either a way to signal libfabric to exit that blocking 
state, to specify a condition for example via a flag or to access info 
about the state of an endpoint receiving data (whether or not in work, 
so the waiting could be done outside)?

Greetings, Arne

Am 10.02.20 um 20:15 schrieb Hefty, Sean:

>> I am pretty new to libfabric, so it is most likely I made a mistake. I
>> just do not find out what the problem is. So user mailing list it is.
>> I got tasked with integrating libfabric into an existing University
>> project. To be precise into the lower level functions of the project
>> which are managing data transfer, but keep their interfaces intact as
>> possible.
>> Since it is testing stage for the implementation sockets provider on a
>> local machine is used. And since the future target servers run on older
>> OS version it will be using version 1.5.4 until OS upgrade.
>> Now to my problem at hand:
>> I use fi_cq_sread() to retrieve information whether Data was previously
>> received (fi_recv). Since the function is called on the Server side, it
>> is possible that no Data is received (Server loops over project receive
>> function to look for new data).
>> Data is spliced in 2 parts, a Header which contains information about
>> the Data to come and the bigger chunks of actual Data. If no header is
>> received, the calls for the actual Data are skipped.
>> When bigger amounts of Data are received, a segfault "heap use after
>> free" occurs regarding the completion queue entry structure in use.
>> The Problem happens at the read for the Data header.
>> It can be circumvented by enlarging the timeout, but you will surely
>> agree that guesswork for good timing and hoping that no additional time
>> is required cant be the solution. And due to the programs structure
>> explained above it is impossible to set the timeout to -1.
>> Any Idea what I did wrong? Do you need additional Information?
> Trying to debug a segfault based on email is challenging.  The only guess I have is to examine the lifetime of the application structures that might be passed as the context into receive operations, or that receive completion processing will want to access.  Ensure that the structure is not re-used until the receive it is associated with completes.
> - Sean

More information about the Libfabric-users mailing list