[ewg] [PATCH 1/2] fmr_pool flush serials can get out of sync
Olaf Kirch
olaf.kirch at oracle.com
Mon Jan 21 01:13:14 PST 2008
On Friday 18 January 2008 23:12, Roland Dreier wrote:
> > The corruption happened when the process that allocated the MRs went
> > away in the middle of the operation. We would free the MR and invalidate
> > - and expect the in flight RDMA to error out. RDS does not know who is
> > doing RDMA to or from a MR at any given time.
>
> OK, I see. Of course this error will move your QP to the error state
> and cause other in-flight operations on behalf of other processes to
> fail and need to be reissued after you reconnect. Seems like a bit of
> a mess but I don't see a way around it if you want to multiplex direct
> access operations to multiple different processes over the same QP.
Yes, and that's the whole point of RDS. Sockets are unconnected and you
use sendto, else we'd drown in sockets. I will readily agree that this
approach, while it's fast and simple, does get us into a bit of a mess
sometimes :-)
> > Is that a safe thing to do? I found the spec a little unclear on
> > the ordering rules. It *seems* that RDMA writes are always fencing
> > against subsequent operations, and RDMA reads will fence if we ask
> > for it. But I'm not perfectly sure whether the ordering applies
> > to the sending system only, or if IB also guarantees that the
> > RDMA will have completed when it puts the incoming message on
> > the completion queue at the consumer.
>
> I believe this is safe. I can't point to chapter and verse in the
> spec, but operations are supposed to complete in order, so I don't
> think that the receive completion can appear before earlier responder
> operations have completed.
Okay, thanks. Much appreciated,
Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
okir at lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax
More information about the ewg
mailing list