[ewg] Issues with fmr_pool

Wed Jan 16 09:34:50 PST 2008

Hi all,

I've been debugging a memory corruption in the RDS zerocopy code for
the past several days - basically, when we tear down a socket and destroy
any existing MRs, RMDA writes that are in progress continue well after
we've freed the MR and flushed the fmr_pool.

After chasing several schools of red herrings I think I understand the
problem. I believe there are two bugs in the fmr_pool code.

The first bug is this:

The fmr_pool has a per pool cleanup thread, which gets woken in two cases.
One, when there are too many FMRs on the dirty_list, and two, when the
user explicitly asked for a flush.

Now, ib_flush_fmr_pool synchronizes with the cleanup thread using two
atomic counters - one is a request serial number, which gets bumped
by ib_flush_fmr_pool, and the other is the flush serial number, which
gets incremented whenever the cleanup pool actually flushes something.
When the two are equal, we've flushed everything, and the cleanup thread
can go back to sleep.

Now the bad thing is, the two can get out of sync. When there are
too many FMRs on the dirty list, the cleanup thread will perform a
flush as well, and bump the flush serial number. The next time around
someone calls ib_flush_fmr_pool, the request serial number is incremented
and *is now equal* to the flush serial number - and nothing is flushed
at all.

The second bug (or maybe it's just a misunderstanding on my part) has
far worse consequences.

When we release a FMR using ib_fmr_pool_unmap, it will do one of two
things. If the fmr's remap_count is less than max_remaps, it will
be added to the free_list right away. If it exceeds max_remaps, it
will be added to the dirty_list.

Now when the user calls ib_flush_fmr_pool, it will only inspect the
dirty list, but leave the free_list alone. So all the while we *think*
we have invalidated all FMRs freed previously, most of them will stay
active because they're not inspected *at all*. So ib_flush_fmr_pool does
nothing 31 out of 32 times (32 is the default max_remaps value).

I will post two patches for these issues in follow-up emails. In general
however I wonder if the fmr_pool interface is really optimal. The major
concern I have is that the whole page pinning, mapping and unmapping
business is the caller's responsibility, but we don't know when the
underlying MR really goes away. So in order to be on the safe side,
the caller has to keep any pages mapped and pinned until the next
call to flush_fmr_pool. IMHO it would be very useful if there was a
callback function that lets you know that a particular MR was
zapped. I guess something like this could be engineered using the
flush_function, but that's really a very spartan interface, and requires
you to keep your deceased MRs on yet another list for later disposal.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax