[ofa-general] Re: [PATCH 01 of 12] Core of mmu notifiers
Andrea Arcangeli
andrea at qumranet.com
Thu Apr 24 08:39:43 PDT 2008
On Thu, Apr 24, 2008 at 04:51:12AM -0500, Robin Holt wrote:
> It seems to me the work done by mmu_notifier_mm_destroy should really
> be done inside the mm_lock()/mm_unlock area of mmu_unregister and
There's no mm_lock/unlock for mmu_unregister anymore. That's the whole
point of using srcu so it becomes reliable and quick.
> mm_notifier_release when we have removed the last entry. That would
> give the users job the same performance after they are done using the
> special device that they had prior to its use.
That's not feasible. Otherwise mmu_notifier_mm will go away at any
time under both _release from exit_mmap and under _unregister
too. exit_mmap holds an mm_count implicit, so freeing mmu_notifier_mm
after the last mmdrop makes it safe. mmu_notifier_unregister also
holds the mm_count because mm_count was pinned by
mmu_notifier_register. That solves the issue with mmu_notifier_mm
going away from under mmu_notifier_unregister and _release and that's
why it can only be freed after mm_count == 0.
There's at least one small issue I noticed so far, that while _release
don't need to care about _register, but _unregister definitely need to
care about _register. I've to take the mmap_sem in addition or in
replacement of the unregister_lock. The srcu_read_lock can also likely
moved just before releasing the unregister_lock but that's just a
minor optimization to make the code more strict.
> On Thu, Apr 24, 2008 at 08:49:40AM +0200, Andrea Arcangeli wrote:
> ...
> > diff --git a/mm/memory.c b/mm/memory.c
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> ...
> > @@ -603,25 +605,39 @@
> > * readonly mappings. The tradeoff is that copy_page_range is more
> > * efficient than faulting.
> > */
> > + ret = 0;
> > if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_PFNMAP|VM_INSERTPAGE))) {
> > if (!vma->anon_vma)
> > - return 0;
> > + goto out;
> > }
> >
> > - if (is_vm_hugetlb_page(vma))
> > - return copy_hugetlb_page_range(dst_mm, src_mm, vma);
> > + if (unlikely(is_vm_hugetlb_page(vma))) {
> > + ret = copy_hugetlb_page_range(dst_mm, src_mm, vma);
> > + goto out;
> > + }
> >
> > + if (is_cow_mapping(vma->vm_flags))
> > + mmu_notifier_invalidate_range_start(src_mm, addr, end);
> > +
> > + ret = 0;
>
> I don't think this is needed.
It's not needed right, but I thought it was cleaner if they all use
"ret" after I had to change the code at the end of the
function. Anyway I'll delete this to make the patch shorter and only
change the minimum, agreed.
> ...
> > +/* avoid memory allocations for mm_unlock to prevent deadlock */
> > +void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data)
> > +{
> > + if (mm->map_count) {
> > + if (data->nr_anon_vma_locks)
> > + mm_unlock_vfree(data->anon_vma_locks,
> > + data->nr_anon_vma_locks);
> > + if (data->i_mmap_locks)
>
> I think you really want data->nr_i_mmap_locks.
Indeed. It never happens that there are zero vmas with filebacked
mappings, this is why this couldn't be triggered in practice, thanks!
> The second paragraph of this comment seems extraneous.
ok removed.
> > + /*
> > + * Wait ->release if mmu_notifier_unregister run list_del_rcu.
> > + * srcu can't go away from under us because one mm_count is
> > + * hold by exit_mmap.
> > + */
>
> These two sentences don't make any sense to me.
Well that was a short explanation of why the mmu_notifier_mm structure
can only be freed after the last mmdrop, which is what you asked at
the top. I'll try to rephrase.
> > +void mmu_notifier_unregister(struct mmu_notifier *mn, struct mm_struct *mm)
> > +{
> > + int before_release = 0, srcu;
> > +
> > + BUG_ON(atomic_read(&mm->mm_count) <= 0);
> > +
> > + srcu = srcu_read_lock(&mm->mmu_notifier_mm->srcu);
> > + spin_lock(&mm->mmu_notifier_mm->unregister_lock);
> > + if (!hlist_unhashed(&mn->hlist)) {
> > + hlist_del_rcu(&mn->hlist);
> > + before_release = 1;
> > + }
> > + spin_unlock(&mm->mmu_notifier_mm->unregister_lock);
> > + if (before_release)
> > + /*
> > + * exit_mmap will block in mmu_notifier_release to
> > + * guarantee ->release is called before freeing the
> > + * pages.
> > + */
> > + mn->ops->release(mn, mm);
>
> I am not certain about the need to do the release callout when the driver
> has already told this subsystem it is done. For XPMEM, this callout
> would immediately return. I would expect it to be the same or GRU.
The point is that you don't want to run it twice. And without this you
will have to serialize against ->release yourself in the driver. It's
much more convenient if you know that ->release will be called just
once, and before mmu_notifier_unregister returns. It could be called
by _release even after you're already inside _unregister, _release may
reach the spinlock before _unregister, you won't notice the
difference. Solving this race in the driver looked too complex, I
rather solve it once inside the mmu notifier code to be sure. Missing
a release event is fatal because all sptes have to be dropped before
_release returns. The requirement is the same for _unregister, all
sptes have to be dropped before it returns. ->release should be able
to sleep as long as it wants even with only 1/N applied. exit_mmap can
sleep too, no problem. You can't unregister inside ->release first of
all because 'ret' instruction must be still allocated to return to mmu
notifier code.
More information about the general
mailing list