[ewg] [PATCH 1/2] fmr_pool flush serials can get out of sync

Olaf Kirch olaf.kirch at oracle.com
Mon Jan 21 01:13:14 PST 2008


On Friday 18 January 2008 23:12, Roland Dreier wrote:
>  > The corruption happened when the process that allocated the MRs went
>  > away in the middle of the operation. We would free the MR and invalidate
>  > - and expect the in flight RDMA to error out. RDS does not know who is
>  > doing RDMA to or from a MR at any given time.
> 
> OK, I see.  Of course this error will move your QP to the error state
> and cause other in-flight operations on behalf of other processes to
> fail and need to be reissued after you reconnect.  Seems like a bit of
> a mess but I don't see a way around it if you want to multiplex direct
> access operations to multiple different processes over the same QP.

Yes, and that's the whole point of RDS. Sockets are unconnected and you
use sendto, else we'd drown in sockets. I will readily agree that this
approach, while it's fast and simple, does get us into a bit of a mess
sometimes :-)

>  > Is that a safe thing to do? I found the spec a little unclear on
>  > the ordering rules. It *seems* that RDMA writes are always fencing
>  > against subsequent operations, and RDMA reads will fence if we ask
>  > for it. But I'm not perfectly sure whether the ordering applies
>  > to the sending system only, or if IB also guarantees that the
>  > RDMA will have completed when it puts the incoming message on
>  > the completion queue at the consumer.
> 
> I believe this is safe.  I can't point to chapter and verse in the
> spec, but operations are supposed to complete in order, so I don't
> think that the receive completion can appear before earlier responder
> operations have completed.

Okay, thanks. Much appreciated,
Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir at lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax



More information about the ewg mailing list