[ewg] [PATCH v2] libibverbs: ibv_fork_init() and libhugetlbfs

Tue Jul 6 08:25:16 PDT 2010

On Sat, 03 Jul 2010 13:19:07 -0700
Roland Dreier <rdreier at cisco.com> wrote:

>  >  When registering two memory regions A and B from within
>  > the same huge page, we will end up with one node in the tree which covers the
>  > whole huge page after registering A. When the second MR is registered, a node
>  > is created with the MR size rounded to the system page size (as there is no
>  > need to call madvise(), it is not noticed that MR B is part of a huge page).
>  > 
>  > Now if MR A is deregistered before MR B, I see that the tree containing
>  > mem_nodes is empty afterwards, which causes problems for the deregistration of
>  > MR B, leaving the tree in a corrupted state with negative refcounts. This also
>  > breaks later registrations of other memory regions within this huge page.
> 
> Good thing I didn't get around to applying the patch yet ;)
> 
> I haven't thought this through fully, but it seems that maybe we could
> extend the madvise tracking tree to keep track of the page size used for
> each node in the tree.  Then for the registration of MR B above, we
> would find the node for MR A covered MR B and we should be able to get
> the ref counting right.

We thought about this too, but in some special cases, we do not know the
correct page size of a memory range. For example when getting a 16M chunk
from a 16M huge page region which is also aligned to 16M, the first madvise()
will work fine and the code will assume that the page size is 64K.

If trying to register a 16M - 64K + 1 byte region, the first madvise() also
works fine. Now if a second memory region which resides in the last 64K is
registered, we end up with the same situation as above.

As this issue was not present in version 2 of the code, but there we had
a big performance penalty, I suggest the following: we could go back to
version 2 and introduce a new RDMAV_HUGEPAGE_SAFE env variable to let the user
decide between huge page support and better performance (the same approach we
use for the COW protection itself). Would this be okay or do you see a problem
with this?

Regards,
Alex