[openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK

Gleb Natapov glebn at voltaire.com
Wed Feb 15 06:25:29 PST 2006


On Wed, Feb 15, 2006 at 12:14:48PM +0200, Michael S. Tsirkin wrote:
> Quoting r. Gleb Natapov <glebn at voltaire.com>:
> > > Clarification: as I see it, longer term we want to add a flag to make
> > > get_user_pages trigger an immediate page copy on fork (rather than
> > > copy_ptes).
> >
> > Can you elaborate? Do you mean one more VMA flag (VM_COPYONFORK)?
> 
> This should hopefully solve more than just the reg_mr issue, and not
> specific to infiniband. See e.g. here: http://lkml.org/lkml/2005/12/12/30
> So no, this will have to be a per-page flag: set by get_user_pages when
> passed some new option, and cleared by put_page when the page ref count
> drops to page map count.
> 
Yes this is very serious issue I wonder why aio users don't complain all
over the lklm. (or should aio buffers have to be aligned?)

> BTW, I dont know when I will get around to working on it, so any help
> would be appreciated.
Do you think new page flag is a viable solution? With the holy war
against new (and old) page flags. Besides fork will have to go from pte to 
struct page to check flags for each mapped page in the process!

> 
> > > > Remember that we should gracefully handle overlapping registrations.
> >
> > > Right, and madvise doesnt do any refcouting. That's one reason not to
> > > include it in reg_mr. 
> >
> > I beg to differ. I think this is exactly the reason to include it in
> > reg_mr. Otherwise each application should reinvent refcounting logic. It
> > is much better to do it right once instead of doing it wrong many times.
> 
> Talking about applications developed directly for infiniband again?
Is this a banned subject? Or is this not recommended for application
programmers to work directly with verbs?


> But why do you think they always use overlapping regions?
> 
I don't know. They should not care about this mundane details.

> > > Another is that madvise only works for full pages.
> >
> > Everything in VM works only for full pages. Unix don't try to hide this
> > from user.
> 
> ibv_reg_mr works fine for sub-page regions. Doesnt it?
> 
Not really. It gives you the impression that it works by not returning an
error and aligning address and lengths for you. Same case with mmap(). You
can provide nonaligned length and it will not fail.

> > > Applications should be aware of these limitations, and I think the easiest
> > > way to achieve this is by asking them to use madvise directly.
> >
> > The problem not in madvice but in refcounting that each application must
> > maintain.
> 
> I dont really see a good way around this.
Why not do it only once in the library that each RDMA application will have to use.

--
			Gleb.



More information about the general mailing list