[libfabric-users] CQ permission denied (-EACCES)
Ezra Kissel
ezkissel at indiana.edu
Wed Feb 10 08:45:18 PST 2016
Starting a new thread with the issue I was originally trying to debug.
We (IU) have libfabric integrated into a task-based runtime system via
an intermediate RDMA middleware library and one of our unit tests was
often failing with the following error:
ALL:ERR: 1 (cq_readerr:53): > local CQ: 13 Permission denied
ALL:ERR: 1 (cq_readerr:54): > local CQ: prov_err: Permission denied (-13)
Obviously, -EACCES is an appropriate CQ error if the incorrect rkey is
specified in an associated RMA op. I've spent a lot of time debugging
this, making sure we are passing the correct rkeys, etc. I've finally
been able to reproduce this issue in a small distributed, threaded test
using the sockets provider again, but unfortunately in a
non-deterministic way.
https://github.com/disprosium8/fi_thread_test/blob/master/fi_rma_thread_mr.c
Our unit test that fails causes a lot of memory pressure, and lots of
concurrent memory registrations interleaved with RMA ops. I tried to
reproduce this in the above test and it will fail with the CQ error
after some number of repeated runs.
I get the sense that there's a race condition somewhere, and/or the
fi_mr descriptors are getting corrupted. I can't reproduce the error
without the "alloc_thread" running (athreads=0) in the test, and indeed
all our other integrated runtime unit tests that have less memory
pressure are passing OK.
Now that I have a stripped-down test, I can try to more closely trace
the root cause but I thought I'd ask for thoughts.
- ezra
More information about the Libfabric-users
mailing list