[openib-general] Question about pinning memory

Pete Wyckoff pw at osc.edu
Sun Jul 24 10:03:35 PDT 2005


jsquyres at open-mpi.org wrote on Fri, 22 Jul 2005 19:04 -0400:
> Otherwise, the scenario in question #3 is a real problem.  There are a 
> few possibilities for fixing it, but all are problematic (override 
> sbrk() via including ptmalloc2 in the distribution, using LD_PRELOAD to 
> override sbrk(), etc.).  Any other suggestions would be welcome...

I did some thinking about this issue a while back and came up with a
cooperative kernel/user implementation to track linux VM activity using
existing vm_area_struct->vm_ops function hooks (i.e. no kernel patch
required).  The MPI library essentially makes a system call before
reusing a cached memory registration to verify it is still valid, and
the kernel module keeps track of what happens to cached mappings as the
VM system is exercised via sbrk, mmap, fork, etc.  It works for any sort
of memory activity, including arbitrary mmap() of memory or files, since
it plugs in at the basic VM interface level.

You can read all about it in this paper presented at CCGrid '05:

    http://www.osc.edu/~pw/papers/wyckoff-memreg-ccgrid05.pdf

As far as where this component should live, I agree it is not exactly
OpenIB's problem, but there are many consumers that will want to do
memory registration caching to avoid the roughly 100 us overhead for
each reg or dereg.  The problem stems from message passing APIs that
want to cache mappings of arbirtrary memory regions handed down from the
application.  And all the libraries must use the same cache too:
consider an app that uses MPI and the user-space parallel file system
PVFS over IB.  The registration cache used by these libraries should be
shared for correctness and for performance.  To further complicate
matters, if you were using two different NIC types, both of which
required memory registration, in the same application, the single
application cache should handle both devices.  Thus it's not just an IB
problem, nor is it just an MPI or other library problem.

I can distribute the code if anyone is curious, but it needs some work
to become production quality.

		-- Pete




More information about the general mailing list