[libfabric-users] CQ permission denied (-EACCES)

Ezra Kissel ezkissel at indiana.edu
Wed Feb 10 14:45:58 PST 2016


So far so good.  I applied the updates within our runtime and our unit 
tests seem much more stable with libfabric/sockets.  My fi thread tests 
also seem happier, although I still see the occasional -FI_EAGAIN after 
tx_size_left() says there's space.  In both my tests and runtime/app 
usage, all RMA TX ops are currently serialized so there shouldn't be 
anything else contending for reads/writes.  Polling for completions and 
memory registrations can and do occur concurrently, however.

One issue I still see in our runtime unit tests is a condition where the 
remote CQ data (immediate) indicates that a write has completed at the 
target, but when I go to check the associated region the read value is 
not as expected.  I assumed that the semantics of popping immediate data 
guaranteed that the writing of the destination buffer has been 
successfully completed.  Is this an incorrect assumption for libfabric 
or is there possibly another race condition here?  I'll see if I can 
reproduce the behavior in my test repository as well, it's occurring 
somewhat rarely at the moment.

- ezra

On 2/10/2016 2:48 PM, Ezra Kissel wrote:
> Great, I will test those changes soon.  Getting some form of
> distributed, threaded test cases in fabtests would be very helpful, I
> think.  I agree the MPI dependence is annoying, needing some general job
> launching and allgather/barrier support instead.  Once I have some free
> cycles again, I might look at integrating some tests myself.
>
> I also have a list of psm issues that have come out of this recent
> debugging... :)
>
> On 02/10/2016 02:30 PM, Jose, Jithin wrote:
>> I think the race was because of an unprotected call to get-mr-key. I
>> was able to reproduce it occasionally.
>>
>> The following patch fixes it, and the test seems to run fine with it;
>> still running in a loop :)
>> https://github.com/jithinjosepkl/libfabric/commit/a3a66678cffb376b4cd2bf033ae65fe05df30af0
>>
>>
>>
>> I have also updated the tx_size_left() to check for the CQ
>> availability. All the patches are in
>> https://github.com/jithinjosepkl/libfabric/commits/pr/sockets. Can you
>> give it a try?
>>
>> Btw, in the application, do you make sure that some other thread is
>> not issuing any TX operations between fi_tx_size_left() and fi_rma*
>> calls? Otherwise, it can still result in -FI_EAGAIN.
>>
>> - Jithin
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: <libfabric-users-bounces at lists.openfabrics.org> on behalf of
>> Ezra Kissel <ezkissel at indiana.edu>
>> Date: Wednesday, February 10, 2016 at 8:45 AM
>> To: "libfabric-users at lists.openfabrics.org"
>> <libfabric-users at lists.openfabrics.org>
>> Subject: [libfabric-users] CQ permission denied (-EACCES)
>>
>>> Starting a new thread with the issue I was originally trying to debug.
>>> We (IU) have libfabric integrated into a task-based runtime system via
>>> an intermediate RDMA middleware library and one of our unit tests was
>>> often failing with the following error:
>>>
>>> ALL:ERR: 1 (cq_readerr:53): > local CQ: 13 Permission denied
>>> ALL:ERR: 1 (cq_readerr:54): > local CQ: prov_err: Permission denied
>>> (-13)
>>>
>>> Obviously, -EACCES is an appropriate CQ error if the incorrect rkey is
>>> specified in an associated RMA op.  I've spent a lot of time debugging
>>> this, making sure we are passing the correct rkeys, etc.  I've finally
>>> been able to reproduce this issue in a small distributed, threaded test
>>> using the sockets provider again, but unfortunately in a
>>> non-deterministic way.
>>>
>>> https://github.com/disprosium8/fi_thread_test/blob/master/fi_rma_thread_mr.c
>>>
>>>
>>> Our unit test that fails causes a lot of memory pressure, and lots of
>>> concurrent memory registrations interleaved with RMA ops.  I tried to
>>> reproduce this in the above test and it will fail with the CQ error
>>> after some number of repeated runs.
>>>
>>> I get the sense that there's a race condition somewhere, and/or the
>>> fi_mr descriptors are getting corrupted.  I can't reproduce the error
>>> without the "alloc_thread" running (athreads=0) in the test, and indeed
>>> all our other integrated runtime unit tests that have less memory
>>> pressure are passing OK.
>>>
>>> Now that I have a stripped-down test, I can try to more closely trace
>>> the root cause but I thought I'd ask for thoughts.
>>>
>>> - ezra
>>> _______________________________________________
>>> Libfabric-users mailing list
>>> Libfabric-users at lists.openfabrics.org
>>> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/libfabric-users



More information about the Libfabric-users mailing list