[ofa-general] Re: New proposal for memory management

Wed Apr 29 05:15:57 PDT 2009

On Apr 29, 2009, at 12:03 AM, Jason Gunthorpe wrote:

> I've often wondered, wouldn't it just be fine for MPI if the entire
> process address space is kept pinned, registered and consistent with
> the HCA? The process would opt in to this behavior during MPI
> startup. Similar in spirit to the all physical memory registration the
> kernel can do.
>

An interesting idea.  As I understand your idea, you essentially have  
to pre-allocate memory to all MPI processes, registering all available  
RAM.  After thinking about this a little bit, I think there are still  
a few problems, though:

- How much memory do you give to each MPI process?  (phys_ram -  
OS_overhead) / num_mpi_processes?  What if each MPI process is not  
created equal -- some need more RAM than others?  Does each MPI  
process need to know at the beginning of time the max memory that it  
might need in the future?  That could be quite difficult to know -- it  
seems like an large new restriction to impose on users.

- As we head towards "manycore", the above problem will get [much]  
worse, because I think we'll be heading back to the days of running  
multiple different MPI jobs on a single machine.  These jobs will have  
no a priori knowledge of each other; if the 2nd MPI job launched on a  
machine needs more than (phys_ram - OS_overhead) / num_processors, how  
is that coordinated with the 1st MPI job that is already running on  
the same machine?

- What about any other (non-MPI) process that needs to run?  If all  
memory after the OS is registered / unswappable / allocated to MPI  
processes, then how do random processes get any memory to run?  (e.g.,  
shell scripts, daemons, ... etc.)  If you simply leave X space un- 
register specifically for such non-MPI processes, how do you decide  
the value of X?

- The preallocation/registration of memory must happen pre-main()  
because the first MPI function that is invoked (MPI_Init()) may not  
occur until well after main(), and potentially after some calls to  
malloc (etc.).  For example, the following is a valid MPI program:

int main(...) {
   int *a = malloc(...);
   MPI_Init(...);
   MPI_Send(a, ...);
   ...
}

Re-reading your brief text; I'm wondering if I missed the zen of what  
you're trying to suggest...?  If I'm off the mark, can you explain  
more?  Thanks.

-- 
Jeff Squyres
Cisco Systems