[openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support

Michael S. Tsirkin mst at mellanox.co.il
Wed Nov 8 05:13:19 PST 2006


Quoting r. Or Gerlitz <ogerlitz at voltaire.com>:
> Subject: Re: [openib-general] [RFC] [PATCH v2] rdma/ib_cm: fix APM support
> 
> Michael S. Tsirkin wrote:
> > Quoting Or Gerlitz <ogerlitz at voltaire.com>:
> >>> Protocols that rely on RC ACK for reliability guarantees (like SDP), basically
> >>> do not make it possible to address the hca failure case: you got an ACK, but
> >>> remote hca could have failed without committing data to memory. So APM failover
> >>> is a requirement for these. It could be iser does not need APM, fine.
> >> This is news to me, does your HCA first sends an ACK and only then does 
> >> the DMA transaction and if needed generates the CQE !?!?!?
> 
> > I can't tell either way, but why not?
> > Consider also that DMA write is a posted transaction - HCA gets no indication
> > when it was committed to memory, so it can not delay the ACK until this occurs.
> 
> OK, OK, I see now the IB spec piece below, it was me expecting somehow 
> too much from IB RC... rethinking on this matter i see now its more 
> problematic to support this ack-following-dma-memory-write-success
> 
> 9.7.5.1.6 ACKNOWLEDGE MESSAGE SCHEDULING
> 
> For SEND or RDMA WRITE requests, an ACK may be scheduled before
> data is actually written into the responder?s memory. The ACK simply 
> indicates that the data has successfully reached the fault domain of the 
> responding node. That is, the data has been received by the channel
> adapter and the channel adapter will write that data to the memory 
> system of the responding node, or the responding application will at 
> least be informed of the failure.
> 
> So anyway, what's your HCA behavior wrt this?

The behavior matches the spec. I can't give you extra guarantees.

> >> and how come APM is the solution to this crazy problem?
> 
> > If HCA failure is a crazy problem, then what is the sane problem APM does *not* solve?
> 
> you misunderstood me, the "crazy problem" was related to my 
> misconception of IB RC ACKs.
> 
> My question is: how does APM solves the problem with transactions whose 
> ACK was received but their data was not written/committed to memory?

APM does not solve it - I just say the problem as formulated is not solvable
without protocol changes.

So all we can solve for a generic RC protocol, is port/switch failure, and APM
solves this elegantly and transparently.

-- 
MST




More information about the general mailing list