[libfabric-users] CQ permission denied (-EACCES)

Wed Feb 10 15:29:21 PST 2016

>So far so good.  I applied the updates within our runtime and our unit 
>tests seem much more stable with libfabric/sockets.  My fi thread tests 
>also seem happier, although I still see the occasional -FI_EAGAIN after 
>tx_size_left() says there's space.  In both my tests and runtime/app 
>usage, all RMA TX ops are currently serialized so there shouldn't be 
>anything else contending for reads/writes.  Polling for completions and 
>memory registrations can and do occur concurrently, however.

So in tx-size-left(), the provider currently checks if there is space left in TX command queue, and space left for atleast one CQ entry. 

If the tx-size-left() and TMA tx operations are serialized, there should be space available in TX command queue. But, the CQ space could get filled in due to in-flight operations. This might lead to the -FI_EAGAIN error.

The provider currently handles new CQ entries even when CQ is full. Just that the app uses more CQ space than what is requested initially.

As a fix, I think the provider should avoid the check for CQ availability in tx/rx-size-left().
Sean - does this sound good to you?

>
>One issue I still see in our runtime unit tests is a condition where the 
>remote CQ data (immediate) indicates that a write has completed at the 
>target, but when I go to check the associated region the read value is 
>not as expected.  I assumed that the semantics of popping immediate data 
>guaranteed that the writing of the destination buffer has been 
>successfully completed.  Is this an incorrect assumption for libfabric 
>or is there possibly another race condition here?  I'll see if I can 
>reproduce the behavior in my test repository as well, it's occurring 
>somewhat rarely at the moment.

Yep, a reproducer will definitely help. :)

>
>- ezra
>
>On 2/10/2016 2:48 PM, Ezra Kissel wrote:
>> Great, I will test those changes soon.  Getting some form of
>> distributed, threaded test cases in fabtests would be very helpful, I
>> think.  I agree the MPI dependence is annoying, needing some general job
>> launching and allgather/barrier support instead.  Once I have some free
>> cycles again, I might look at integrating some tests myself.
>>
>> I also have a list of psm issues that have come out of this recent
>> debugging... :)
>>
>> On 02/10/2016 02:30 PM, Jose, Jithin wrote:
>>> I think the race was because of an unprotected call to get-mr-key. I
>>> was able to reproduce it occasionally.
>>>
>>> The following patch fixes it, and the test seems to run fine with it;
>>> still running in a loop :)
>>> https://github.com/jithinjosepkl/libfabric/commit/a3a66678cffb376b4cd2bf033ae65fe05df30af0
>>>
>>>
>>>
>>> I have also updated the tx_size_left() to check for the CQ
>>> availability. All the patches are in
>>> https://github.com/jithinjosepkl/libfabric/commits/pr/sockets. Can you
>>> give it a try?
>>>
>>> Btw, in the application, do you make sure that some other thread is
>>> not issuing any TX operations between fi_tx_size_left() and fi_rma*
>>> calls? Otherwise, it can still result in -FI_EAGAIN.
>>>
>>> - Jithin
>>>
>>>
>>>
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: <libfabric-users-bounces at lists.openfabrics.org> on behalf of
>>> Ezra Kissel <ezkissel at indiana.edu>
>>> Date: Wednesday, February 10, 2016 at 8:45 AM
>>> To: "libfabric-users at lists.openfabrics.org"
>>> <libfabric-users at lists.openfabrics.org>
>>> Subject: [libfabric-users] CQ permission denied (-EACCES)
>>>
>>>> Starting a new thread with the issue I was originally trying to debug.
>>>> We (IU) have libfabric integrated into a task-based runtime system via
>>>> an intermediate RDMA middleware library and one of our unit tests was
>>>> often failing with the following error:
>>>>
>>>> ALL:ERR: 1 (cq_readerr:53): > local CQ: 13 Permission denied
>>>> ALL:ERR: 1 (cq_readerr:54): > local CQ: prov_err: Permission denied
>>>> (-13)
>>>>
>>>> Obviously, -EACCES is an appropriate CQ error if the incorrect rkey is
>>>> specified in an associated RMA op.  I've spent a lot of time debugging
>>>> this, making sure we are passing the correct rkeys, etc.  I've finally
>>>> been able to reproduce this issue in a small distributed, threaded test
>>>> using the sockets provider again, but unfortunately in a
>>>> non-deterministic way.
>>>>
>>>> https://github.com/disprosium8/fi_thread_test/blob/master/fi_rma_thread_mr.c
>>>>
>>>>
>>>> Our unit test that fails causes a lot of memory pressure, and lots of
>>>> concurrent memory registrations interleaved with RMA ops.  I tried to
>>>> reproduce this in the above test and it will fail with the CQ error
>>>> after some number of repeated runs.
>>>>
>>>> I get the sense that there's a race condition somewhere, and/or the
>>>> fi_mr descriptors are getting corrupted.  I can't reproduce the error
>>>> without the "alloc_thread" running (athreads=0) in the test, and indeed
>>>> all our other integrated runtime unit tests that have less memory
>>>> pressure are passing OK.
>>>>
>>>> Now that I have a stripped-down test, I can try to more closely trace
>>>> the root cause but I thought I'd ask for thoughts.
>>>>
>>>> - ezra
>>>> _______________________________________________
>>>> Libfabric-users mailing list
>>>> Libfabric-users at lists.openfabrics.org
>>>> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
>> _______________________________________________
>> Libfabric-users mailing list
>> Libfabric-users at lists.openfabrics.org
>> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
>_______________________________________________
>Libfabric-users mailing list
>Libfabric-users at lists.openfabrics.org
>http://lists.openfabrics.org/mailman/listinfo/libfabric-users