[ofa-general] Re: [PATCH 2/3] libmlx4 - Optimize memory allocation of QP buffers with 64K pages

Tue May 19 23:00:47 PDT 2009

  Hi Roland,

On Tue, 19 May 2009 15:01:13 -0700
Roland Dreier <rdreier at cisco.com> wrote:

>  >   QP buffers are allocated with mlx4_alloc_buf(), which rounds the buffers
>  > size to the page size and then allocates page aligned memory using
>  > posix_memalign().
>  > 
>  >   However, this allocation is quite wasteful on architectures using 64K pages
>  > (ia64 for example) because we then hit glibc's MMAP_THRESHOLD malloc
>  > parameter and chunks are allocated using mmap. thus we end up allocating:
>  > 
>  > (requested size rounded to the page size) + (page size) + (malloc overhead)
>  > 
>  > rounded internally to the page size.
>  > 
>  >   So for example, if we request a buffer of page_size bytes, we end up
>  > consuming 3 pages. In short, for each QP buffer we allocate, there is an
>  > overhead of 2 pages. This is quite visible on large clusters especially where
>  > the number of QP can reach several thousands.
>  > 
>  >   This patch creates a new function mlx4_alloc_page() for use by
>  > mlx4_alloc_qp_buf() that does an mmap() instead of a posix_memalign() when
>  > the page size is 64K.
> 
> makes sense I guess.  It would be nice if glibc() were smart enough to
> know that mmap(MAP_ANONYMOUS) is going to give something page-aligned
> anyway,

  If you mean in the posix_memalign() path, then yes it'd be really nice.

> but it seems that malloc overhead (required to make the memory
> from posix_memalign() work with free()) is going to cost at least one
> extra page, which as you point out is pretty bad with 64KB pages.  (Of
> course 64KB pages are a disaster for any workload that deals with small
> objects of any kind, but that's another story)

  Yep, agreed.

> 
> However I wonder why we want to make this optimization only for 64KB
> pages.  It seems the code would be simpler if we just had our own
> page-aligned allocator using mmap(MAP_ANONYMOUS) and just used it
> unconditionally everywhere.  Or is it not actually better even on
> sane-sized (ie 4KB) page systems?  It seems we still have the malloc
> overhead which is going to cost us another page?

  Well not really, because if we stay below MMAP_THRESHOLD, as we do
with 4K pages, the only overhead is malloc's chaining structure. The
extra space used to align the buffer is released before posix_memalign()
returns, but that does increase fragmentation of mallocs chunks.

  Also, for 4K pages, mmap() systematically results in a syscall whereas
posix_memalign() does not necessarily, but as we're not on a fast path
I'm not sure what would be best. I don't mind converting all QP buffers
allocation to mmap(), but I'd like to hear what people think.

  Thanks Roland,

  Sebastien.