[libfabric-users] CQ permission denied (-EACCES)
Ezra Kissel
ezkissel at indiana.edu
Wed Feb 10 14:45:58 PST 2016
So far so good. I applied the updates within our runtime and our unit
tests seem much more stable with libfabric/sockets. My fi thread tests
also seem happier, although I still see the occasional -FI_EAGAIN after
tx_size_left() says there's space. In both my tests and runtime/app
usage, all RMA TX ops are currently serialized so there shouldn't be
anything else contending for reads/writes. Polling for completions and
memory registrations can and do occur concurrently, however.
One issue I still see in our runtime unit tests is a condition where the
remote CQ data (immediate) indicates that a write has completed at the
target, but when I go to check the associated region the read value is
not as expected. I assumed that the semantics of popping immediate data
guaranteed that the writing of the destination buffer has been
successfully completed. Is this an incorrect assumption for libfabric
or is there possibly another race condition here? I'll see if I can
reproduce the behavior in my test repository as well, it's occurring
somewhat rarely at the moment.
- ezra
On 2/10/2016 2:48 PM, Ezra Kissel wrote:
> Great, I will test those changes soon. Getting some form of
> distributed, threaded test cases in fabtests would be very helpful, I
> think. I agree the MPI dependence is annoying, needing some general job
> launching and allgather/barrier support instead. Once I have some free
> cycles again, I might look at integrating some tests myself.
>
> I also have a list of psm issues that have come out of this recent
> debugging... :)
>
> On 02/10/2016 02:30 PM, Jose, Jithin wrote:
>> I think the race was because of an unprotected call to get-mr-key. I
>> was able to reproduce it occasionally.
>>
>> The following patch fixes it, and the test seems to run fine with it;
>> still running in a loop :)
>> https://github.com/jithinjosepkl/libfabric/commit/a3a66678cffb376b4cd2bf033ae65fe05df30af0
>>
>>
>>
>> I have also updated the tx_size_left() to check for the CQ
>> availability. All the patches are in
>> https://github.com/jithinjosepkl/libfabric/commits/pr/sockets. Can you
>> give it a try?
>>
>> Btw, in the application, do you make sure that some other thread is
>> not issuing any TX operations between fi_tx_size_left() and fi_rma*
>> calls? Otherwise, it can still result in -FI_EAGAIN.
>>
>> - Jithin
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: <libfabric-users-bounces at lists.openfabrics.org> on behalf of
>> Ezra Kissel <ezkissel at indiana.edu>
>> Date: Wednesday, February 10, 2016 at 8:45 AM
>> To: "libfabric-users at lists.openfabrics.org"
>> <libfabric-users at lists.openfabrics.org>
>> Subject: [libfabric-users] CQ permission denied (-EACCES)
>>
>>> Starting a new thread with the issue I was originally trying to debug.
>>> We (IU) have libfabric integrated into a task-based runtime system via
>>> an intermediate RDMA middleware library and one of our unit tests was
>>> often failing with the following error:
>>>
>>> ALL:ERR: 1 (cq_readerr:53): > local CQ: 13 Permission denied
>>> ALL:ERR: 1 (cq_readerr:54): > local CQ: prov_err: Permission denied
>>> (-13)
>>>
>>> Obviously, -EACCES is an appropriate CQ error if the incorrect rkey is
>>> specified in an associated RMA op. I've spent a lot of time debugging
>>> this, making sure we are passing the correct rkeys, etc. I've finally
>>> been able to reproduce this issue in a small distributed, threaded test
>>> using the sockets provider again, but unfortunately in a
>>> non-deterministic way.
>>>
>>> https://github.com/disprosium8/fi_thread_test/blob/master/fi_rma_thread_mr.c
>>>
>>>
>>> Our unit test that fails causes a lot of memory pressure, and lots of
>>> concurrent memory registrations interleaved with RMA ops. I tried to
>>> reproduce this in the above test and it will fail with the CQ error
>>> after some number of repeated runs.
>>>
>>> I get the sense that there's a race condition somewhere, and/or the
>>> fi_mr descriptors are getting corrupted. I can't reproduce the error
>>> without the "alloc_thread" running (athreads=0) in the test, and indeed
>>> all our other integrated runtime unit tests that have less memory
>>> pressure are passing OK.
>>>
>>> Now that I have a stripped-down test, I can try to more closely trace
>>> the root cause but I thought I'd ask for thoughts.
>>>
>>> - ezra
>>> _______________________________________________
>>> Libfabric-users mailing list
>>> Libfabric-users at lists.openfabrics.org
>>> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
More information about the Libfabric-users
mailing list