[libfabric-users] assertion failure, perhaps in registration cache, in 1.10.1 with verbs; ofi_rxm

Titus, Greg gregory.titus at hpe.com
Thu Jul 30 15:22:12 PDT 2020


Thanks, Sean.  FI_MR_CACHE_MONITOR=memhooks failed in the same way.  I'll file an Issue.

greg

________________________________
From: Hefty, Sean <sean.hefty at intel.com>
Sent: Thursday, July 30, 2020 3:27 PM
To: Titus, Greg <gregory.titus at hpe.com>; libfabric-users at lists.openfabrics.org <libfabric-users at lists.openfabrics.org>
Subject: RE: [libfabric-users] assertion failure, perhaps in registration cache, in 1.10.1 with verbs; ofi_rxm

I would open an github issue for this.

You can also try using the other cache monitor (MEMHOOKS). That will also capture non-malloc based allocations (e.g. mmap), though you would need to use a custom build off master to pickup coverage for some allocations (sbrk I think).

> I'm using the verbs;ofi_rxm provider with libfabric 1.10.1, on an IB-based Cray CS
> system.  I saw this in fi_mr(3):
>
>        As a general rule, if hardware requires the FI_MR_LOCAL mode bit described above,
> but this is not supported by the application, a memory registration cache may be in
> use.
>
>
>
> I thought to myself, "Let's try it!"  I set FI_MR_CACHE_MONITOR=userfaultfd, because my
> application doesn't necessarily allocate all its memory through malloc() etc.  I
> removed FI_MR_LOCAL from my hints, while retaining (FI_MR_VIRT_ADDR | FI_MR_PROV_KEY |
> FI_MR_ALLOCATED).   My only other FI_* env var was FI_LOG_LEVEL=Warn.  I verified that
> I still got the verbs;ofi_rxm provider, and that FI_MR_LOCAL was clear in the returned
> info.  My 2-node test case ran properly, but then failed with the following assertion
> on both nodes, in the call stack for fi_close(&ofi_domain->fid) (where ofi_domain is
> the result of the fi_domain() call):
>
>        a.out: .../prov/util/src/util_buf.c:220: ofi_bufpool_destroy: Assertion `(pool-
> >attr.flags & OFI_BUFPOOL_NO_TRACK) || (buf_region->use_cnt == 0)' failed.
>
>
>
>
> In the resulting core file, I find that it's the second clause (buf_region->use_cnt ==
> 0) of the assertion that's false.  That use_cnt is 1 (one).  No output seemed to result
> from my having set FI_LOG_LEVEL=Warn.
>
>
> What's going on?  Do I need to do some other setup to use the registration cache?  Have
> I failed to fi_close() something?  (I looked and nothing jumped out at me, plus this
> exact binary runs fine if I include FI_MR_LOCAL in the hints and don't change anything
> else.)
>
>
> thanks,
> greg

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/libfabric-users/attachments/20200730/9d31e7d8/attachment.htm>


More information about the Libfabric-users mailing list