[libfabric-users] CQ permission denied (-EACCES)

Ezra Kissel ezkissel at indiana.edu
Wed Feb 10 08:45:18 PST 2016


Starting a new thread with the issue I was originally trying to debug. 
We (IU) have libfabric integrated into a task-based runtime system via 
an intermediate RDMA middleware library and one of our unit tests was 
often failing with the following error:

ALL:ERR: 1 (cq_readerr:53): > local CQ: 13 Permission denied
ALL:ERR: 1 (cq_readerr:54): > local CQ: prov_err: Permission denied (-13)

Obviously, -EACCES is an appropriate CQ error if the incorrect rkey is 
specified in an associated RMA op.  I've spent a lot of time debugging 
this, making sure we are passing the correct rkeys, etc.  I've finally 
been able to reproduce this issue in a small distributed, threaded test 
using the sockets provider again, but unfortunately in a 
non-deterministic way.

https://github.com/disprosium8/fi_thread_test/blob/master/fi_rma_thread_mr.c

Our unit test that fails causes a lot of memory pressure, and lots of 
concurrent memory registrations interleaved with RMA ops.  I tried to 
reproduce this in the above test and it will fail with the CQ error 
after some number of repeated runs.

I get the sense that there's a race condition somewhere, and/or the 
fi_mr descriptors are getting corrupted.  I can't reproduce the error 
without the "alloc_thread" running (athreads=0) in the test, and indeed 
all our other integrated runtime unit tests that have less memory 
pressure are passing OK.

Now that I have a stripped-down test, I can try to more closely trace 
the root cause but I thought I'd ask for thoughts.

- ezra



More information about the Libfabric-users mailing list