[libfabric-users] assertion failure, perhaps in registration cache, in 1.10.1 with verbs; ofi_rxm
Titus, Greg
gregory.titus at hpe.com
Thu Jul 30 14:04:19 PDT 2020
I'm using the verbs;ofi_rxm provider with libfabric 1.10.1, on an IB-based Cray CS system. I saw this in fi_mr(3):
As a general rule, if hardware requires the FI_MR_LOCAL mode bit described above, but this is not supported by the application, a memory registration cache may be in use.
I thought to myself, "Let's try it!" I set FI_MR_CACHE_MONITOR=userfaultfd, because my application doesn't necessarily allocate all its memory through malloc() etc. I removed FI_MR_LOCAL from my hints, while retaining (FI_MR_VIRT_ADDR | FI_MR_PROV_KEY | FI_MR_ALLOCATED). My only other FI_* env var was FI_LOG_LEVEL=Warn. I verified that I still got the verbs;ofi_rxm provider, and that FI_MR_LOCAL was clear in the returned info. My 2-node test case ran properly, but then failed with the following assertion on both nodes, in the call stack for fi_close(&ofi_domain->fid) (where ofi_domain is the result of the fi_domain() call):
a.out: .../prov/util/src/util_buf.c:220: ofi_bufpool_destroy: Assertion `(pool->attr.flags & OFI_BUFPOOL_NO_TRACK) || (buf_region->use_cnt == 0)' failed.
In the resulting core file, I find that it's the second clause (buf_region->use_cnt == 0) of the assertion that's false. That use_cnt is 1 (one). No output seemed to result from my having set FI_LOG_LEVEL=Warn.
What's going on? Do I need to do some other setup to use the registration cache? Have I failed to fi_close() something? (I looked and nothing jumped out at me, plus this exact binary runs fine if I include FI_MR_LOCAL in the hints and don't change anything else.)
thanks,
greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200730/797a01f2/attachment-0001.htm>
More information about the Libfabric-users
mailing list