[openib-general] Re: Question about pinning memory

Pete Wyckoff pw at osc.edu
Mon Jul 25 16:26:24 PDT 2005


mst at mellanox.co.il wrote on Tue, 26 Jul 2005 00:11 +0300:
> Quoting r. Pete Wyckoff <pw at osc.edu>:
> > I'm using copy_to_user() to shadow the number of available VM events
> > in the queue to an int32 in the user application.
> 
> So, do you keep all VM events in this queue, forever?
> Can a userspace application make this queue grow very large?

No, of course not.  Each event is currently 8 bytes:  hca handle and
memory handle, as supplied at registration time.  The size of the queue
is configurable, and only needs to be big enough to hold any forced
deregistrations that happen from VM activity like sbrk(<0) and munmap()
between MPI (or other library) calls.  (If you add the signaling
mechanism proposed by Gleb, perhaps with a separate thread, it might be
possible to drain the queue immediately as it approaches fullness.)

If there are more invalidations than will fit in the queue, the library
gets an error on the next read() from the module's character device.  At
that point the library should deregister all dynamic buffers, wipe its
registration cache, re-open the device (perhaps with a larger queue
depth) and continue with a now empty registration cache.  Actually it
must preserve and re-inform the kernel module about the registrations that
are still in use, but all this is knowable.

Hopefully this never happens, but abusive benchmarks might trigger an
overflow:

    buf = mmap(0, 1000 * 1024, ...)
    for (i=0; i<1000; i++)
	MPI_Send(buf + i * 1024, 1024, ...)
    munmap(buf, 1000 * 1024)

Each MPI_Send causes the library to register a separate 1 kB chunk.  The
munmap() will force the invalidation of all those cached registrations,
putting 1000 events in the queue at once.  We hope people don't write
apps like this, but if they do, at least the overall mechanism fails in
a way that doesn't generate any user-observable fault.  You just get the
same lousy throughput as if there were no registration cache at all.

		-- Pete



More information about the general mailing list