[libfabric-users] CQ permission denied (-EACCES)

Howard hppritcha at gmail.com
Wed Feb 10 14:51:56 PST 2016


Ezra

Check out ported/omb in
ofi-cray/fabtests-cray on github and see if that's what you're wanting. Does need a pmi library though.

Howard


Von meinem iPhone gesendet

> Am 10.02.2016 um 12:48 schrieb Ezra Kissel <ezkissel at indiana.edu>:
> 
> Great, I will test those changes soon.  Getting some form of distributed, threaded test cases in fabtests would be very helpful, I think.  I agree the MPI dependence is annoying, needing some general job launching and allgather/barrier support instead.  Once I have some free cycles again, I might look at integrating some tests myself.
> 
> I also have a list of psm issues that have come out of this recent debugging... :)
> 
>> On 02/10/2016 02:30 PM, Jose, Jithin wrote:
>> I think the race was because of an unprotected call to get-mr-key. I was able to reproduce it occasionally.
>> 
>> The following patch fixes it, and the test seems to run fine with it; still running in a loop :)
>> https://github.com/jithinjosepkl/libfabric/commit/a3a66678cffb376b4cd2bf033ae65fe05df30af0
>> 
>> 
>> I have also updated the tx_size_left() to check for the CQ availability. All the patches are in https://github.com/jithinjosepkl/libfabric/commits/pr/sockets. Can you give it a try?
>> 
>> Btw, in the application, do you make sure that some other thread is not issuing any TX operations between fi_tx_size_left() and fi_rma* calls? Otherwise, it can still result in -FI_EAGAIN.
>> 
>> - Jithin
>> 
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: <libfabric-users-bounces at lists.openfabrics.org> on behalf of Ezra Kissel <ezkissel at indiana.edu>
>> Date: Wednesday, February 10, 2016 at 8:45 AM
>> To: "libfabric-users at lists.openfabrics.org" <libfabric-users at lists.openfabrics.org>
>> Subject: [libfabric-users] CQ permission denied (-EACCES)
>> 
>>> Starting a new thread with the issue I was originally trying to debug.
>>> We (IU) have libfabric integrated into a task-based runtime system via
>>> an intermediate RDMA middleware library and one of our unit tests was
>>> often failing with the following error:
>>> 
>>> ALL:ERR: 1 (cq_readerr:53): > local CQ: 13 Permission denied
>>> ALL:ERR: 1 (cq_readerr:54): > local CQ: prov_err: Permission denied (-13)
>>> 
>>> Obviously, -EACCES is an appropriate CQ error if the incorrect rkey is
>>> specified in an associated RMA op.  I've spent a lot of time debugging
>>> this, making sure we are passing the correct rkeys, etc.  I've finally
>>> been able to reproduce this issue in a small distributed, threaded test
>>> using the sockets provider again, but unfortunately in a
>>> non-deterministic way.
>>> 
>>> https://github.com/disprosium8/fi_thread_test/blob/master/fi_rma_thread_mr.c
>>> 
>>> Our unit test that fails causes a lot of memory pressure, and lots of
>>> concurrent memory registrations interleaved with RMA ops.  I tried to
>>> reproduce this in the above test and it will fail with the CQ error
>>> after some number of repeated runs.
>>> 
>>> I get the sense that there's a race condition somewhere, and/or the
>>> fi_mr descriptors are getting corrupted.  I can't reproduce the error
>>> without the "alloc_thread" running (athreads=0) in the test, and indeed
>>> all our other integrated runtime unit tests that have less memory
>>> pressure are passing OK.
>>> 
>>> Now that I have a stripped-down test, I can try to more closely trace
>>> the root cause but I thought I'd ask for thoughts.
>>> 
>>> - ezra
>>> _______________________________________________
>>> Libfabric-users mailing list
>>> Libfabric-users at lists.openfabrics.org
>>> http://lists.openfabrics.org/mailman/listinfo/libfabric-users
> _______________________________________________
> Libfabric-users mailing list
> Libfabric-users at lists.openfabrics.org
> http://lists.openfabrics.org/mailman/listinfo/libfabric-users



More information about the Libfabric-users mailing list