[libfabric-users] assertion failure, perhaps in registration cache, in 1.10.1 with verbs; ofi_rxm

Titus, Greg gregory.titus at hpe.com
Thu Jul 30 14:04:19 PDT 2020


I'm using the verbs;ofi_rxm provider with libfabric 1.10.1, on an IB-based Cray CS  system.  I saw this in fi_mr(3):
As a general rule, if hardware requires the FI_MR_LOCAL mode bit described above, but this is not supported by the application, a memory registration cache may be in use.

I thought to myself, "Let's try it!"  I set FI_MR_CACHE_MONITOR=userfaultfd, because my application doesn't necessarily allocate all its memory through malloc() etc.  I removed FI_MR_LOCAL from my hints, while retaining (FI_MR_VIRT_ADDR | FI_MR_PROV_KEY | FI_MR_ALLOCATED).   My only other FI_* env var was FI_LOG_LEVEL=Warn.  I verified that I still got the verbs;ofi_rxm provider, and that FI_MR_LOCAL was clear in the returned info.  My 2-node test case ran properly, but then failed with the following assertion on both nodes, in the call stack for fi_close(&ofi_domain->fid) (where ofi_domain is the result of the fi_domain() call):
a.out: .../prov/util/src/util_buf.c:220: ofi_bufpool_destroy: Assertion `(pool->attr.flags & OFI_BUFPOOL_NO_TRACK) || (buf_region->use_cnt == 0)' failed.

In the resulting core file, I find that it's the second clause (buf_region->use_cnt == 0) of the assertion that's false.  That use_cnt is 1 (one).  No output seemed to result from my having set FI_LOG_LEVEL=Warn.

What's going on?  Do I need to do some other setup to use the registration cache?  Have I failed to fi_close() something?  (I looked and nothing jumped out at me, plus this exact binary runs fine if I include FI_MR_LOCAL in the hints and don't change anything else.)

thanks,
greg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200730/797a01f2/attachment-0001.htm>


More information about the Libfabric-users mailing list