[ofa-general] post_recv question

Tom Tucker tom at opengridcomputing.com
Thu Feb 21 17:17:14 PST 2008


On Thu, 2008-02-21 at 15:48 -0800, Caitlin Bestler wrote:
> Good example, more detailed comments in-line.
> 
> On Thu, Feb 21, 2008 at 2:47 PM, Tom Tucker <tom at opengridcomputing.com> wrote:
> >
> >  On Thu, 2008-02-21 at 12:22 -0800, Roland Dreier wrote:
> >  > > OpenMPI can be configured to send credit updates over different QP. I'll
> >  >  > try to stress it next week to see what happens.
> >  >
> >  > It seems that it would be pretty hard to hit this race in practice.
> >
> >  > And I don't think mem-free Mellanox hardware has any race -- not
> >  > positive about Tavor/non-mem-free Arbel.  (On IB you need to set RNR
> >  > retries to 0 also for the missing receive to be detectable even if the
> >  > race exists)
> >
> >  Well....consider the case of two adapters on two different pci busses.
> >  One is busy one is not. Specifically, the post_recv QP is on an HCA on a
> >  busy bus, the post_send (of the credit) is on a QP on an HCA on a
> >  dedicated bus.
> >
> >  I think we can assume that the ringing of the doorbell is synchronous,
> >  i.e. when the processor completes it's write, the card knows there are
> >  RQ WQE available in host memory, but whether or not and when the WQE is
> >  fetched relative to the processor is asynchronous. The card will have to
> >  get on the bus again and read host memory. Meanwhile the processor runs
> >  off and posts a send on the other QP on a different HCA of the credit.
> >  The peer responds, with a send to the "data qp". The receiving adapter
> >  knows the WQE is there, but it may not have fetched it yet.
> >
> >  The crux of the question is whether or not the adapter MUST fetch the
> >  WQE and place the packet, or can it simply drop it. If you say it MUST,
> >  then you must have enough buffer to handle worst case delayed placement.
> >  If the post guarantee is only within the same QP or affiliated QP (SRQ),
> >  then all it must do is ensure that when processing a SQ request AND the
> >  associated RQ (SRQ) is empty, that it must fetch outstanding, unread RQ
> >  WQE prior to processing the SQ WQE. This allows for the post_recv
> >  guarantees without the HCA buffering requirements.
> >
> 
> I disagree. What is required is the adapter MUST NOT take an action based
> on a "buffer not available" diagnosis until it is certain that it has considered
> all WQEs that have been successfully posted by the consumer.
> 

Ok. So what does the HW do with the packet while it's pondering it's
options? It has to put it somewhere. That's my point. You either
guarantee that any advertisement of availability can't be issued prior
to the buffer being available, or the buffer is synchronously available
prior to the advertisement of the credit. Snooping the [s]RQ while
processing SQ is a way of delaying the issuance of a credit before the
buffer (spec'd in the WQE) is actually known to the adapter. But this
only works in the context of a single HBA.

> Further, it MUST NOT require a further action by the consumer to guarantee
> that it notices a posted WQE. 

Agreed. 

> Particularly in iWARP the application layer
> is free to implement Send/Recv credits by *any* mechanism desired (the
> only requirement is that there is one, you might recall that there were
> extensive discussions on this point regarding unsolicited messages for
> iSER). The concept that the application MUST provide SOME form of
> flow control was accepted only grudgingly. So clearly any more specific
> mechanisms were not the intent of the drafters.

Yes, but I'm not sure there's any confusion there -- I think this
discussion is about "how credits can be issued". In particular what does
it mean to issue a credit for:
- this QP,
- another QP on the same HCA
- another QP on a different HCA

So far, it seems the consensus is that "all of the above" should work.
I'm just not convinced the current implementations guarantee this.

> 
> So if there are still 1000 Recv WQEs in the SRQ we can allow the adapter
> a great deal of flexibility in when the 1001st is linked into the data
> structures.
> The only real constraint is that it MUST do 1001 successful allocations
> *before* it triggers any sort of "buffer not available" error.
> 

agreed.

> I'm not recalling the specific language immediately, but I do recall concluding
> that sub-dividing the SRQ on an RSS-like basis was *not* compliant with
> the RDMAC specs and that the left-half of the adpater could not declare
> "buffer not found" while the right-half of the adapter still had a free buffer.

agreed.

> This is of course a major pain if you are trying to team two RDMA adapters
> to form a single virtual adapter, or even two largely independent ports on
> the same physical adapter. But the intent of the specifications are very
> clear: if the consumer has posted 1000 recv WQEs and gotten "SUCCESS"
> to each of them, then the adapter MUST allocate all 1000 recv WQEs
> *before* it can fail an operation because no buffer was available.
> 

agreed.

> So there is a difference between "must be pushed to the adapter now"
> and "must be pushed to the adapter before it is too late".

yes. 


Tom








More information about the general mailing list