[openib-general] Re: Re: madvise MADV_DONTFORK/MADV_DOFORK

Michael S. Tsirkin mst at mellanox.co.il
Wed Feb 15 02:14:48 PST 2006


Quoting r. Gleb Natapov <glebn at voltaire.com>:
> > Clarification: as I see it, longer term we want to add a flag to make
> > get_user_pages trigger an immediate page copy on fork (rather than
> > copy_ptes).
>
> Can you elaborate? Do you mean one more VMA flag (VM_COPYONFORK)?

This should hopefully solve more than just the reg_mr issue, and not
specific to infiniband. See e.g. here: http://lkml.org/lkml/2005/12/12/30
So no, this will have to be a per-page flag: set by get_user_pages when
passed some new option, and cleared by put_page when the page ref count
drops to page map count.

BTW, I dont know when I will get around to working on it, so any help
would be appreciated.

> > In this setup, MADV_DONTFORK will be used to speed up fork for an
> > application that has locked a big portion of its address space. With this in
> > mind:
> > 
> > Quoting r. Gleb Natapov <glebn at voltaire.com>:
> > > > > Should call to madvise be the part of reg_mr call?
> > > > 
> > > > Probably no - MPI should have to do it.
> >
> > uDAPL as well, I guess.
> > 
> > > Then each userspace app will have to reinvent the wheel.
> >
> > I thought applications used MPI?
>
> I hope you don't think that infiniband is good only for HPC :) More and more
> organisation want to develop applications directly for infiniband without
> middle layer. Not all of them want to understand deep VM magic to do so.

See my comment above. Once pages locked by get_user_pages are copied on fork,
madvise becomes an optimization to speed up fork. So life as usual:
you need to get linux-specific to get some speedup.

> > > Remember that we should gracefully handle overlapping registrations.
>
> > Right, and madvise doesnt do any refcouting. That's one reason not to
> > include it in reg_mr. 
>
> I beg to differ. I think this is exactly the reason to include it in
> reg_mr. Otherwise each application should reinvent refcounting logic. It
> is much better to do it right once instead of doing it wrong many times.

Talking about applications developed directly for infiniband again?
But why do you think they always use overlapping regions?

> > Another is that madvise only works for full pages.
>
> Everything in VM works only for full pages. Unix don't try to hide this
> from user.

ibv_reg_mr works fine for sub-page regions. Doesnt it?

> > Applications should be aware of these limitations, and I think the easiest
> > way to achieve this is by asking them to use madvise directly.
>
> The problem not in madvice but in refcounting that each application must
> maintain.

I dont really see a good way around this.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies



More information about the general mailing list