[ofa-general] New proposal for memory management

Fri May 1 05:48:39 PDT 2009

On Apr 30, 2009, at 6:01 PM, Woodruff, Robert J wrote:

> To me, all this sounds like a lot of whining....
> Why can't the OS fix all my problems.

Absolutely not.  As Brian stated, we have cited some real-world  
problems that we cannot fix (and we have tried many, many different  
workarounds over the past few years to fix them).

It sounds like your main objection to fixing them is "it's too much  
work."  :-(

> There's an application at Sandia and at Los Alamos which both of  
> which cause problems for our linker tricks.  This leads to such  
> things as (proven) silent data corruption.
>

There are other apps that have also been reported over the years.  C++  
apps with their own allocators as especially problematic.  Abaqus had  
to change their memory allocation model several years ago to be able  
to workaround these issues.  These memory models also break valgrind,  
purify, and other memory-checking debuggers.

> Have you tried these applications with any MPI other than OpenMPI ?   
> i.e., does this corruption happen with Intel MPI and other MPIs as  
> well?
>

We have been trying to say that this is a general problem that there  
currently is no guaranteed fix for.  There's always a way to break the  
MPI workarounds for verbs' broken memory management model because  
there's no way to guarantee the memory allocation hooks.

There's two main reasons for fix these issues:

1. Business: to attract network programmers to verbs (and therefore to  
attract applications and therefore increase market share), it has to  
be simpler and within reach of today's commodity sockets-level  
programmers.  Forcing them to have registration caches and to do  
memory allocation hooking significantly raises the bar.  To date, this  
has been shunned by all network programmers except HPC and a handful  
of storage protocols.

2. Technical: if OFED says "to get good performance with verbs, you  
have to do malloc/mmap/etc. hooks and have a registration cache, "this  
unnecessarily *significantly* raises the education and code complexity  
barrier to entry for verbs programmers.  It's also un-scaleable -- if  
this is something you *have* to do for good performance, why doesn't  
the network stack do it?  It seems weird that you would effectively  
force all ULPs/MPIs/applications to implement the same functionality.   
The memory allocation hooking model also fails if more than one verbs- 
based middleware is used in the same application (because only one  
will be able to use the memory hooks per process).

Here's a story that encompasses both reasons:

We had Open MPI *not* use the registration cache by default for a long  
time because of the danger it posed to applications.  Users could  
activate the registration cache with a simple command line parameter.   
But nobody would do that -- they wanted to run with top performance  
right out of the box (which is not unreasonable).  It also led to  
OMPI's competitors -- ahem, *YOU* at Sonoma 2009 (!) -- citing "look,  
Open MPI's performance is bad!  Our MPI's performance is GREAT!"  Open  
MPI therefore was forced to change its defaults in the 1.3 series to  
activate the [dangerous] memory registration cache by default.

You mentioned that doing this stuff is a choice; the choice that MPI's/ 
ULPs/applications therefore have is:

- don't use registration caches/memory allocation hooking, have  
terrible performance
- use registration caches/memory allocation hooking, have good  
performance

Which is no choice at all.  If customers pay top dollar for these  
networks, they want to see benchmarks run out of the box that show  
that they're getting every flop/byte-per-second that they can.  The  
fact that the programming model is needlessly complicated (and  
dangerous) to get that performance is something that the MPI's have  
tolerated because we had to for competition's sake.

This is not something that non-HPC customers will accept.

> Of the solutions that have been presented so far,
> I think the kernel notifier approach would be a better solution.
>

Note that Jason G. said in this thread: "Notifiers are going to be  
very troublesome, every time any sort of synchronous to user space  
notifier has been proposed or implemented in the kernel it has been a  
disaster."

-- 
Jeff Squyres
Cisco Systems