[libfabric-users] CQ permission denied (-EACCES)

Jose, Jithin jithin.jose at intel.com
Wed Feb 10 16:55:20 PST 2016


Ezra,

Can you try the benchmark with the following branch to see if we still hit the EAGAIN error?
https://github.com/jithinjosepkl/libfabric/tree/pr/sockets


Is there a specific option that we need to pass in to benchmark to cause the EAGAIN error? I had tried to reproduce it with 2 processes and default arguments, but the error never showed up. Please let me know.

- Jithin






-----Original Message-----
From: <libfabric-users-bounces at lists.openfabrics.org> on behalf of Jithin Jose <jithin.jose at intel.com>
Date: Wednesday, February 10, 2016 at 3:29 PM
To: Ezra Kissel <ezkissel at indiana.edu>, "libfabric-users at lists.openfabrics.org" <libfabric-users at lists.openfabrics.org>
Subject: Re: [libfabric-users] CQ permission denied (-EACCES)

>
>>So far so good.  I applied the updates within our runtime and our unit 
>>tests seem much more stable with libfabric/sockets.  My fi thread tests 
>>also seem happier, although I still see the occasional -FI_EAGAIN after 
>>tx_size_left() says there's space.  In both my tests and runtime/app 
>>usage, all RMA TX ops are currently serialized so there shouldn't be 
>>anything else contending for reads/writes.  Polling for completions and 
>>memory registrations can and do occur concurrently, however.
>
>So in tx-size-left(), the provider currently checks if there is space left in TX command queue, and space left for atleast one CQ entry. 
>
>If the tx-size-left() and TMA tx operations are serialized, there should be space available in TX command queue. But, the CQ space could get filled in due to in-flight operations. This might lead to the -FI_EAGAIN error.
>
>The provider currently handles new CQ entries even when CQ is full. Just that the app uses more CQ space than what is requested initially.
>
>As a fix, I think the provider should avoid the check for CQ availability in tx/rx-size-left().
>Sean - does this sound good to you?
>
>>
>>One issue I still see in our runtime unit tests is a condition where the 
>>remote CQ data (immediate) indicates that a write has completed at the 
>>target, but when I go to check the associated region the read value is 
>>not as expected.  I assumed that the semantics of popping immediate data 
>>guaranteed that the writing of the destination buffer has been 
>>successfully completed.  Is this an incorrect assumption for libfabric 
>>or is there possibly another race condition here?  I'll see if I can 
>>reproduce the behavior in my test repository as well, it's occurring 
>>somewhat rarely at the moment.
>
>Yep, a reproducer will definitely help. :)
>
>>
>>- ezra
>>
>>On 2/10/2016 2:48 PM, Ezra Kissel wrote:
>>> Great, I will test those changes soon.  Getting some form of
>>> distributed, threaded test cases in fabtests would be very helpful, I
>>> think.  I agree the MPI dependence is annoying, needing some general job
>>> launching and allgather/barrier support instead.  Once I have some free
>>> cycles again, I might look at integrating some tests myself.
>>>
>>> I also have a list of psm issues that have come out of this recent
>>> debugging... :)
>>>
>>> On 02/10/2016 02:30 PM, Jose, Jithin wrote:
>>>> I think the race was because of an unprotected call to get-mr-key. I
>>>> was able to reproduce it occasionally.
>>>>
>>>> The following patch fixes it, and the test seems to run fine with it;
>>>> still running in a loop :)
>>>> https://github.com/jithinjosepkl/libfabric/commit/a3a66678cffb376b4cd2bf033ae65fe05df30af0
>>>>
>>>>
>>>>
>>>> I have also updated the tx_size_left() to check for the CQ
>>>> availability. All the patches are in
>>>> https://github.com/jithinjosepkl/libfabric/commits/pr/sockets. Can you
>>>> give it a try?
>>>>
>>>> Btw, in the application, do you make sure that some other thread is
>>>> not issuing any TX operations between fi_tx_size_left() and fi_rma*
>>>> calls? Otherwise, it can still result in -FI_EAGAIN.
>>>>
>>>> - Jithin
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: <libfabric-users-bounces at lists.openfabrics.org> on behalf of
>>>> Ezra Kissel <ezkissel at indiana.edu>
>>>> Date: Wednesday, February 10, 2016 at 8:45 AM
>>>> To: "libfabric-users at lists.openfabrics.org"
>>>> <libfabric-users at lists.openfabrics.org>
>>>> Subject: [libfabric-users] CQ permission denied (-EACCES)
>>>>
>>>>> Starting a new thread with the issue I was originally trying to debug.
>>>>> We (IU) have libfabric integrated into a task-based runtime system via
>>>>> an intermediate RDMA middleware library and one of our unit tests was
>>>>> often failing with the following error:
>>>>>
>>>>> ALL:ERR: 1 (cq_readerr:53): > local CQ: 13 Permission denied
>>>>> ALL:ERR: 1 (cq_readerr:54): > local CQ: prov_err: Permission denied
>>>>> (-13)
>>>>>
>>>>> Obviously, -EACCES is an appropriate CQ error if the incorrect rkey is
>>>>> specified in an associated RMA op.  I've spent a lot of time debugging
>>>>> this, making sure we are passing the correct rkeys, etc.  I've finally
>>>>> been able to reproduce this issue in a small distributed, threaded test
>>>>> using the sockets provider again, but unfortunately in a
>>>>> non-deterministic way.
>>>>>
>>>>> https://github.com/disprosium8/fi_thread_test/blob/master/fi_rma_thread_mr.c
>>>>>
>>>>>
>>>>> Our unit test that fails causes a lot of memory pressure, and lots of
>>>>> concurrent memory registrations interleaved with RMA ops.  I tried to
>>>>> reproduce this in the above test and it will fail with the CQ error
>>>>> after some number of repeated runs.
>>>>>
>>>>> I get the sense that there's a race condition somewhere, and/or the
>>>>> fi_mr descriptors are getting corrupted.  I can't reproduce the error
>>>>> without the "alloc_thread" running (athreads=0) in the test, and indeed
>>>>> all our other integrated runtime unit tests that have less memory
>>>>> pressure are passing OK.
>>>>>
>>>>> Now that I have a stripped-down test, I can try to more closely trace
>>>>> the root cause but I thought I'd ask for thoughts.
>>>>>
>>>>> - ezra
>>>>> _______________________________________________
>>>>> Libfabric-users mailing list
>>>>> Libfabric-users at lists.openfabrics.org
>>>>> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
>>> _______________________________________________
>>> Libfabric-users mailing list
>>> Libfabric-users at lists.openfabrics.org
>>> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
>>_______________________________________________
>>Libfabric-users mailing list
>>Libfabric-users at lists.openfabrics.org
>>http://lists.openfabrics.org/mailman/listinfo/libfabric-users
>_______________________________________________
>Libfabric-users mailing list
>Libfabric-users at lists.openfabrics.org
>http://lists.openfabrics.org/mailman/listinfo/libfabric-users


More information about the Libfabric-users mailing list