[libfabric-users] CQ permission denied (-EACCES)
Ezra Kissel
ezkissel at indiana.edu
Wed Feb 10 20:54:19 PST 2016
Looks like your latest patches cleared up the EAGAIN error for me. I am
not seeing the immediate data issue any longer either, not sure if
related or if your updates are masking it for me now. I'm also not
discounting a bug somewhere in our runtime's circular buffers.
I did create a test for checking the value in the destination buffer
after popping immediate data from the CQ. I haven't been able to force
any value assertion failures so that's good.
https://github.com/disprosium8/fi_thread_test/blob/master/fi_rma_thread_rcq.c
I did notice that when the above test attempts a TX op targeting a
remote address beyond the allocated buffer, the test simply hangs
instead of any error being generated. You can recreate this by reducing
NRBUFS and commenting out the assert on line 205. If I blatantly
increase the remote address offset at the start, I get the expected
permission denied CQ error, but somewhere along the line the mr bounds
checking seems to fail, or else that bad CQ event never gets popped.
- ezra
On 2/10/2016 7:55 PM, Jose, Jithin wrote:
> Ezra,
>
> Can you try the benchmark with the following branch to see if we still hit the EAGAIN error?
> https://github.com/jithinjosepkl/libfabric/tree/pr/sockets
>
>
> Is there a specific option that we need to pass in to benchmark to cause the EAGAIN error? I had tried to reproduce it with 2 processes and default arguments, but the error never showed up. Please let me know.
>
> - Jithin
>
>
>
>
>
>
> -----Original Message-----
> From: <libfabric-users-bounces at lists.openfabrics.org> on behalf of Jithin Jose <jithin.jose at intel.com>
> Date: Wednesday, February 10, 2016 at 3:29 PM
> To: Ezra Kissel <ezkissel at indiana.edu>, "libfabric-users at lists.openfabrics.org" <libfabric-users at lists.openfabrics.org>
> Subject: Re: [libfabric-users] CQ permission denied (-EACCES)
>
>>
>>> So far so good. I applied the updates within our runtime and our unit
>>> tests seem much more stable with libfabric/sockets. My fi thread tests
>>> also seem happier, although I still see the occasional -FI_EAGAIN after
>>> tx_size_left() says there's space. In both my tests and runtime/app
>>> usage, all RMA TX ops are currently serialized so there shouldn't be
>>> anything else contending for reads/writes. Polling for completions and
>>> memory registrations can and do occur concurrently, however.
>>
>> So in tx-size-left(), the provider currently checks if there is space left in TX command queue, and space left for atleast one CQ entry.
>>
>> If the tx-size-left() and TMA tx operations are serialized, there should be space available in TX command queue. But, the CQ space could get filled in due to in-flight operations. This might lead to the -FI_EAGAIN error.
>>
>> The provider currently handles new CQ entries even when CQ is full. Just that the app uses more CQ space than what is requested initially.
>>
>> As a fix, I think the provider should avoid the check for CQ availability in tx/rx-size-left().
>> Sean - does this sound good to you?
>>
>>>
>>> One issue I still see in our runtime unit tests is a condition where the
>>> remote CQ data (immediate) indicates that a write has completed at the
>>> target, but when I go to check the associated region the read value is
>>> not as expected. I assumed that the semantics of popping immediate data
>>> guaranteed that the writing of the destination buffer has been
>>> successfully completed. Is this an incorrect assumption for libfabric
>>> or is there possibly another race condition here? I'll see if I can
>>> reproduce the behavior in my test repository as well, it's occurring
>>> somewhat rarely at the moment.
>>
>> Yep, a reproducer will definitely help. :)
>>
>>>
>>> - ezra
>>>
>>> On 2/10/2016 2:48 PM, Ezra Kissel wrote:
>>>> Great, I will test those changes soon. Getting some form of
>>>> distributed, threaded test cases in fabtests would be very helpful, I
>>>> think. I agree the MPI dependence is annoying, needing some general job
>>>> launching and allgather/barrier support instead. Once I have some free
>>>> cycles again, I might look at integrating some tests myself.
>>>>
>>>> I also have a list of psm issues that have come out of this recent
>>>> debugging... :)
>>>>
>>>> On 02/10/2016 02:30 PM, Jose, Jithin wrote:
>>>>> I think the race was because of an unprotected call to get-mr-key. I
>>>>> was able to reproduce it occasionally.
>>>>>
>>>>> The following patch fixes it, and the test seems to run fine with it;
>>>>> still running in a loop :)
>>>>> https://github.com/jithinjosepkl/libfabric/commit/a3a66678cffb376b4cd2bf033ae65fe05df30af0
>>>>>
>>>>>
>>>>>
>>>>> I have also updated the tx_size_left() to check for the CQ
>>>>> availability. All the patches are in
>>>>> https://github.com/jithinjosepkl/libfabric/commits/pr/sockets. Can you
>>>>> give it a try?
>>>>>
>>>>> Btw, in the application, do you make sure that some other thread is
>>>>> not issuing any TX operations between fi_tx_size_left() and fi_rma*
>>>>> calls? Otherwise, it can still result in -FI_EAGAIN.
>>>>>
>>>>> - Jithin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: <libfabric-users-bounces at lists.openfabrics.org> on behalf of
>>>>> Ezra Kissel <ezkissel at indiana.edu>
>>>>> Date: Wednesday, February 10, 2016 at 8:45 AM
>>>>> To: "libfabric-users at lists.openfabrics.org"
>>>>> <libfabric-users at lists.openfabrics.org>
>>>>> Subject: [libfabric-users] CQ permission denied (-EACCES)
>>>>>
>>>>>> Starting a new thread with the issue I was originally trying to debug.
>>>>>> We (IU) have libfabric integrated into a task-based runtime system via
>>>>>> an intermediate RDMA middleware library and one of our unit tests was
>>>>>> often failing with the following error:
>>>>>>
>>>>>> ALL:ERR: 1 (cq_readerr:53): > local CQ: 13 Permission denied
>>>>>> ALL:ERR: 1 (cq_readerr:54): > local CQ: prov_err: Permission denied
>>>>>> (-13)
>>>>>>
>>>>>> Obviously, -EACCES is an appropriate CQ error if the incorrect rkey is
>>>>>> specified in an associated RMA op. I've spent a lot of time debugging
>>>>>> this, making sure we are passing the correct rkeys, etc. I've finally
>>>>>> been able to reproduce this issue in a small distributed, threaded test
>>>>>> using the sockets provider again, but unfortunately in a
>>>>>> non-deterministic way.
>>>>>>
>>>>>> https://github.com/disprosium8/fi_thread_test/blob/master/fi_rma_thread_mr.c
>>>>>>
>>>>>>
>>>>>> Our unit test that fails causes a lot of memory pressure, and lots of
>>>>>> concurrent memory registrations interleaved with RMA ops. I tried to
>>>>>> reproduce this in the above test and it will fail with the CQ error
>>>>>> after some number of repeated runs.
>>>>>>
>>>>>> I get the sense that there's a race condition somewhere, and/or the
>>>>>> fi_mr descriptors are getting corrupted. I can't reproduce the error
>>>>>> without the "alloc_thread" running (athreads=0) in the test, and indeed
>>>>>> all our other integrated runtime unit tests that have less memory
>>>>>> pressure are passing OK.
>>>>>>
>>>>>> Now that I have a stripped-down test, I can try to more closely trace
>>>>>> the root cause but I thought I'd ask for thoughts.
>>>>>>
>>>>>> - ezra
>>>>>> _______________________________________________
>>>>>> Libfabric-users mailing list
>>>>>> Libfabric-users at lists.openfabrics.org
>>>>>> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
>>>> _______________________________________________
>>>> Libfabric-users mailing list
>>>> Libfabric-users at lists.openfabrics.org
>>>> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
>>> _______________________________________________
>>> Libfabric-users mailing list
>>> Libfabric-users at lists.openfabrics.org
>>> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
>> _______________________________________________
>> Libfabric-users mailing list
>> Libfabric-users at lists.openfabrics.org
>> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
More information about the Libfabric-users
mailing list