[ewg] [PATCH 1/2] fmr_pool flush serials can get out of sync
rdreier at cisco.com
Fri Jan 18 14:12:03 PST 2008
> The corruption happened when the process that allocated the MRs went
> away in the middle of the operation. We would free the MR and invalidate
> - and expect the in flight RDMA to error out. RDS does not know who is
> doing RDMA to or from a MR at any given time.
OK, I see. Of course this error will move your QP to the error state
and cause other in-flight operations on behalf of other processes to
fail and need to be reissued after you reconnect. Seems like a bit of
a mess but I don't see a way around it if you want to multiplex direct
access operations to multiple different processes over the same QP.
> When RDS performs an RDMA, the initiator will queue two work requests -
> one for the actual RDMA, immediately followed by a normal SEND with
> a RDS packet. When the consumer sees that RDS packet, it will
> release the MR to which the RDMA was directed.
> Is that a safe thing to do? I found the spec a little unclear on
> the ordering rules. It *seems* that RDMA writes are always fencing
> against subsequent operations, and RDMA reads will fence if we ask
> for it. But I'm not perfectly sure whether the ordering applies
> to the sending system only, or if IB also guarantees that the
> RDMA will have completed when it puts the incoming message on
> the completion queue at the consumer.
I believe this is safe. I can't point to chapter and verse in the
spec, but operations are supposed to complete in order, so I don't
think that the receive completion can appear before earlier responder
operations have completed.
More information about the ewg