[ofa-general] Re: [RFC PATCH] rds: enable rdma on iWARP

Steve Wise swise at opengridcomputing.com
Mon Jul 28 10:56:00 PDT 2008


Olaf Kirch wrote:
> On Monday 28 July 2008 17:29:20 Jon Mason wrote:
>   
>> This bulk of this patch is removing the pre-existing posting of the invalidate
>> logic and adding it prior to the fastreg send posting.  The previous logic
>> assumed that posting an invalidate to a dummy qp would successfully invalidate
>> the entry.  Unfortunately, the invalidate must be posted on the same qp as the
>> fastreg and the pre-existing logic does not have a way to get the qp the fastreg
>> is posted on.
>>     
>
>   

This isn't quite correct.  The invalidate must be posted on a connected 
qp in the same pd.  But it doesn't have to be the same qp as the 
fastreg.  However if you use different qps, then you must coordinate 
that you're done using the mr before invalidating it. 


> Then I don't see how this is going to work, ever. When the Oracle IPC
> creates an MR, we do not know yet with which peer it wants to use it.
> And in fact it may want to use the same MR with several peers... I'm
> not sure about that detail but I think that's the case.
>
> First off, there's a semantic snag here which I wanted to avoid by
> *not* pairing the inval with the remap. Essentially, if the application
> calls FREE_MR with the invalidate flag set, it actually expects *all*
> previously freed MRs to be invalidated. In the FMR world, this
> amounts to unmapping all FMRs on the dirty list, and batch destroying
> them - this means we clean up the host side data structures, then issue
> a SYNC_TPT, and we're done. Not fast, but if you get good batching it
> doesn't slow you down too much.
>
> Now when you pair remaps and invalidates, you will get MRs that are
> in the process of being remapped, but the LOCAL_INV hasnt completed
> yet - so you need to add a lot of tracking for these. I tried,
> and it became very ugly very fast. That's why I used separate code
> paths for remap and invalidate - and as far as I understand there's
> no problem with that. The r_key's remap counter gets incremented
> every time you map something, so you essentially get a different
> r_key each time. Am I correct that with this approach, you can have
> a RKEY(b, v) made up of a base stag b and a version counter v.
> You can map
> 	MAP RKEY(b, 0) -> some memory
> 	MAP RKEY(b, 1) -> some other memory
> 	MAP RKEY(b, 2) -> yet some other memory
> 	INVAL RKEY(b, 0)
> 	INVAL RKEY(b, 1)
>   

No you must invalidate the MR between fastreg calls.  Like this:

FASTREG RKEY(b, 0) -> some memory
INVALIDATE RKEY(b, 0)
FASTREG RKEY(b, 1) -> some other memory
INVALIDATE RKEY(b, 1)
FASTREG RKEY(b, 2) -> yet some other memory

If you post them all on the same QP then you can use fencing to keep the 
pipeline full.  If you want to use different qps for the invalidates, 
then you must manage that you invalidate only when you're done using them.

> and so on? Or does the HCA driver keep pointers to the caller's
> data structures around somewhere so that repeated MAP requests
> without intervening INVAL would lead to corruption?
>   
It doesn't have to do with the callers data structs.  The simple fact 
is, you cannot fast register a mr to more than one pbl at a time.   If 
you think about adapter resources, there is a single MR entry for the 
fast reg MR and one PBL entry for whatever the pbl is for that current 
mapping. 
> If that is the case, I would leave the approach of a separate map
> and inval in place, because free+invalidate becomes rather simple
> with this: you just post all the inval requests and wait for them
> to complete.
>   

Note you can just dereg the MR to invalidate the last mapping.  IE you 
don't need to post an invalidate if you are going to call ib_dereg_mr() 
to destroy the fast reg mr.

> My original approach was rather simplistic, in that I wanted to
> post the INVAL request to just a single dummy QP. If that doesn't
> work (which I think is a deficiency of the interface) then we
> need to record the original rds_conn somewhere with the mapping,
> so that we know which QP to post it to.
>   

First off, any send WR posted to a QP not in RTS does nothing.  For 
iWARP QPs, you only enter into RTS when the QP is connected.  So a dummy 
QP just won't work.  We can argure about the interface deficiencies if 
you want, but the semantics are part of the IBTA and iWARP specs, so we 
probably shouldn't change it much.


> However, I still have doubts all of this will work very well.
> If you have to pipeline R_Key invalidations to a variety of QPs,
> you may face QPs that are heavily contended - actually so much
> that you may not even be able to get a single inval request onto
> the queue because the application keeps hogging the pipe with SENDs
> or other transactions. IOW a single SEND intensive application can
> starve another app calling FREE+invalidate almost indefinitely.
>
>   
I would think a single SEND intensive app could starve other apps trying 
to use the same QP anyway, so you must have some sort of fairness logic, eh?

> Second, what do you do if a QP errors out? Does that render all
> R_Keys issued previously on that QP invalid? 

No.  the R_Key's aren't tied to the QP except that if you have pending 
fastreg or invalidate WRS, then it is tied to that QP until the WRs 
complete.

> That sounds like a
> real knockout problem to me. Imagine 1000 processing doing RDMA,
> all of them busily obtaining r_keys for some mapping, and asking
> the remote to rdma to/from that memory. Now thanks to an application
> bug, *one* transfer refers to an r_key that's bogus. Now your
> connection goes down with remote access error. Big deal, RDS
> will just nix all outstanding RDMA transfers, reconnect and create
> a new QP.
>
> Now, if it is actually as you say and memory registrations are
> bound to the QP they were created on, this means all previously
> created mappings have been invalidated at once. What happens next?
> Some applications will obtain a fresh mapping and retry their
> RDMA. Others will have been lucky, and didn't have a RDMA in flight
> that was dropped on the floor - they will initiate a RDMA with
> a r_key that is suddenly no longer valid! Guess what happens - the
> connection goes down again!
>
> This looks a lot like network chernobyl to me.
> Or, if you will, a design flaw. A mapping obtained on a
> given QP should be usable with other QPs bound to the same
> device, and you should be able to invalidate it on any QP
> bound to the same device.
>
>   

It is.  You can.


> Olaf
>   




More information about the general mailing list