[ofa-general] New proposal for memory management
Jeff Squyres
jsquyres at cisco.com
Fri May 1 05:48:39 PDT 2009
On Apr 30, 2009, at 6:01 PM, Woodruff, Robert J wrote:
> To me, all this sounds like a lot of whining....
> Why can't the OS fix all my problems.
Absolutely not. As Brian stated, we have cited some real-world
problems that we cannot fix (and we have tried many, many different
workarounds over the past few years to fix them).
It sounds like your main objection to fixing them is "it's too much
work." :-(
> There's an application at Sandia and at Los Alamos which both of
> which cause problems for our linker tricks. This leads to such
> things as (proven) silent data corruption.
>
There are other apps that have also been reported over the years. C++
apps with their own allocators as especially problematic. Abaqus had
to change their memory allocation model several years ago to be able
to workaround these issues. These memory models also break valgrind,
purify, and other memory-checking debuggers.
> Have you tried these applications with any MPI other than OpenMPI ?
> i.e., does this corruption happen with Intel MPI and other MPIs as
> well?
>
We have been trying to say that this is a general problem that there
currently is no guaranteed fix for. There's always a way to break the
MPI workarounds for verbs' broken memory management model because
there's no way to guarantee the memory allocation hooks.
There's two main reasons for fix these issues:
1. Business: to attract network programmers to verbs (and therefore to
attract applications and therefore increase market share), it has to
be simpler and within reach of today's commodity sockets-level
programmers. Forcing them to have registration caches and to do
memory allocation hooking significantly raises the bar. To date, this
has been shunned by all network programmers except HPC and a handful
of storage protocols.
2. Technical: if OFED says "to get good performance with verbs, you
have to do malloc/mmap/etc. hooks and have a registration cache, "this
unnecessarily *significantly* raises the education and code complexity
barrier to entry for verbs programmers. It's also un-scaleable -- if
this is something you *have* to do for good performance, why doesn't
the network stack do it? It seems weird that you would effectively
force all ULPs/MPIs/applications to implement the same functionality.
The memory allocation hooking model also fails if more than one verbs-
based middleware is used in the same application (because only one
will be able to use the memory hooks per process).
Here's a story that encompasses both reasons:
We had Open MPI *not* use the registration cache by default for a long
time because of the danger it posed to applications. Users could
activate the registration cache with a simple command line parameter.
But nobody would do that -- they wanted to run with top performance
right out of the box (which is not unreasonable). It also led to
OMPI's competitors -- ahem, *YOU* at Sonoma 2009 (!) -- citing "look,
Open MPI's performance is bad! Our MPI's performance is GREAT!" Open
MPI therefore was forced to change its defaults in the 1.3 series to
activate the [dangerous] memory registration cache by default.
You mentioned that doing this stuff is a choice; the choice that MPI's/
ULPs/applications therefore have is:
- don't use registration caches/memory allocation hooking, have
terrible performance
- use registration caches/memory allocation hooking, have good
performance
Which is no choice at all. If customers pay top dollar for these
networks, they want to see benchmarks run out of the box that show
that they're getting every flop/byte-per-second that they can. The
fact that the programming model is needlessly complicated (and
dangerous) to get that performance is something that the MPI's have
tolerated because we had to for competition's sake.
This is not something that non-HPC customers will accept.
> Of the solutions that have been presented so far,
> I think the kernel notifier approach would be a better solution.
>
Note that Jason G. said in this thread: "Notifiers are going to be
very troublesome, every time any sort of synchronous to user space
notifier has been proposed or implemented in the kernel it has been a
disaster."
--
Jeff Squyres
Cisco Systems
More information about the general
mailing list