[ewg] [PATCH 1/2] fmr_pool flush serials can get out of sync

Roland Dreier rdreier at cisco.com
Fri Jan 18 14:12:03 PST 2008


 > The corruption happened when the process that allocated the MRs went
 > away in the middle of the operation. We would free the MR and invalidate
 > - and expect the in flight RDMA to error out. RDS does not know who is
 > doing RDMA to or from a MR at any given time.

OK, I see.  Of course this error will move your QP to the error state
and cause other in-flight operations on behalf of other processes to
fail and need to be reissued after you reconnect.  Seems like a bit of
a mess but I don't see a way around it if you want to multiplex direct
access operations to multiple different processes over the same QP.

 > When RDS performs an RDMA, the initiator will queue two work requests -
 > one for the actual RDMA, immediately followed by a normal SEND with
 > a RDS packet. When the consumer sees that RDS packet, it will
 > release the MR to which the RDMA was directed.
 > 
 > Is that a safe thing to do? I found the spec a little unclear on
 > the ordering rules. It *seems* that RDMA writes are always fencing
 > against subsequent operations, and RDMA reads will fence if we ask
 > for it. But I'm not perfectly sure whether the ordering applies
 > to the sending system only, or if IB also guarantees that the
 > RDMA will have completed when it puts the incoming message on
 > the completion queue at the consumer.

I believe this is safe.  I can't point to chapter and verse in the
spec, but operations are supposed to complete in order, so I don't
think that the receive completion can appear before earlier responder
operations have completed.

 - R.




More information about the ewg mailing list