[ofa-general] Re: [RFC PATCH] rds: enable rdma on iWARP

Mon Jul 28 09:13:22 PDT 2008

On Monday 28 July 2008 17:29:20 Jon Mason wrote:
> This bulk of this patch is removing the pre-existing posting of the invalidate
> logic and adding it prior to the fastreg send posting.  The previous logic
> assumed that posting an invalidate to a dummy qp would successfully invalidate
> the entry.  Unfortunately, the invalidate must be posted on the same qp as the
> fastreg and the pre-existing logic does not have a way to get the qp the fastreg
> is posted on.

Then I don't see how this is going to work, ever. When the Oracle IPC
creates an MR, we do not know yet with which peer it wants to use it.
And in fact it may want to use the same MR with several peers... I'm
not sure about that detail but I think that's the case.

First off, there's a semantic snag here which I wanted to avoid by
*not* pairing the inval with the remap. Essentially, if the application
calls FREE_MR with the invalidate flag set, it actually expects *all*
previously freed MRs to be invalidated. In the FMR world, this
amounts to unmapping all FMRs on the dirty list, and batch destroying
them - this means we clean up the host side data structures, then issue
a SYNC_TPT, and we're done. Not fast, but if you get good batching it
doesn't slow you down too much.

Now when you pair remaps and invalidates, you will get MRs that are
in the process of being remapped, but the LOCAL_INV hasnt completed
yet - so you need to add a lot of tracking for these. I tried,
and it became very ugly very fast. That's why I used separate code
paths for remap and invalidate - and as far as I understand there's
no problem with that. The r_key's remap counter gets incremented
every time you map something, so you essentially get a different
r_key each time. Am I correct that with this approach, you can have
a RKEY(b, v) made up of a base stag b and a version counter v.
You can map
	MAP RKEY(b, 0) -> some memory
	MAP RKEY(b, 1) -> some other memory
	MAP RKEY(b, 2) -> yet some other memory
	INVAL RKEY(b, 0)
	INVAL RKEY(b, 1)
and so on? Or does the HCA driver keep pointers to the caller's
data structures around somewhere so that repeated MAP requests
without intervening INVAL would lead to corruption?

If that is the case, I would leave the approach of a separate map
and inval in place, because free+invalidate becomes rather simple
with this: you just post all the inval requests and wait for them
to complete.

My original approach was rather simplistic, in that I wanted to
post the INVAL request to just a single dummy QP. If that doesn't
work (which I think is a deficiency of the interface) then we
need to record the original rds_conn somewhere with the mapping,
so that we know which QP to post it to.

However, I still have doubts all of this will work very well.
If you have to pipeline R_Key invalidations to a variety of QPs,
you may face QPs that are heavily contended - actually so much
that you may not even be able to get a single inval request onto
the queue because the application keeps hogging the pipe with SENDs
or other transactions. IOW a single SEND intensive application can
starve another app calling FREE+invalidate almost indefinitely.

Second, what do you do if a QP errors out? Does that render all
R_Keys issued previously on that QP invalid? That sounds like a
real knockout problem to me. Imagine 1000 processing doing RDMA,
all of them busily obtaining r_keys for some mapping, and asking
the remote to rdma to/from that memory. Now thanks to an application
bug, *one* transfer refers to an r_key that's bogus. Now your
connection goes down with remote access error. Big deal, RDS
will just nix all outstanding RDMA transfers, reconnect and create
a new QP.

Now, if it is actually as you say and memory registrations are
bound to the QP they were created on, this means all previously
created mappings have been invalidated at once. What happens next?
Some applications will obtain a fresh mapping and retry their
RDMA. Others will have been lucky, and didn't have a RDMA in flight
that was dropped on the floor - they will initiate a RDMA with
a r_key that is suddenly no longer valid! Guess what happens - the
connection goes down again!

This looks a lot like network chernobyl to me.
Or, if you will, a design flaw. A mapping obtained on a
given QP should be usable with other QPs bound to the same
device, and you should be able to invalidate it on any QP
bound to the same device.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax