[ofa-general] post_recv question

Caitlin Bestler caitlin.bestler at neterion.com
Thu Feb 21 15:48:33 PST 2008


Good example, more detailed comments in-line.

On Thu, Feb 21, 2008 at 2:47 PM, Tom Tucker <tom at opengridcomputing.com> wrote:
>
>  On Thu, 2008-02-21 at 12:22 -0800, Roland Dreier wrote:
>  > > OpenMPI can be configured to send credit updates over different QP. I'll
>  >  > try to stress it next week to see what happens.
>  >
>  > It seems that it would be pretty hard to hit this race in practice.
>
>  > And I don't think mem-free Mellanox hardware has any race -- not
>  > positive about Tavor/non-mem-free Arbel.  (On IB you need to set RNR
>  > retries to 0 also for the missing receive to be detectable even if the
>  > race exists)
>
>  Well....consider the case of two adapters on two different pci busses.
>  One is busy one is not. Specifically, the post_recv QP is on an HCA on a
>  busy bus, the post_send (of the credit) is on a QP on an HCA on a
>  dedicated bus.
>
>  I think we can assume that the ringing of the doorbell is synchronous,
>  i.e. when the processor completes it's write, the card knows there are
>  RQ WQE available in host memory, but whether or not and when the WQE is
>  fetched relative to the processor is asynchronous. The card will have to
>  get on the bus again and read host memory. Meanwhile the processor runs
>  off and posts a send on the other QP on a different HCA of the credit.
>  The peer responds, with a send to the "data qp". The receiving adapter
>  knows the WQE is there, but it may not have fetched it yet.
>
>  The crux of the question is whether or not the adapter MUST fetch the
>  WQE and place the packet, or can it simply drop it. If you say it MUST,
>  then you must have enough buffer to handle worst case delayed placement.
>  If the post guarantee is only within the same QP or affiliated QP (SRQ),
>  then all it must do is ensure that when processing a SQ request AND the
>  associated RQ (SRQ) is empty, that it must fetch outstanding, unread RQ
>  WQE prior to processing the SQ WQE. This allows for the post_recv
>  guarantees without the HCA buffering requirements.
>

I disagree. What is required is the adapter MUST NOT take an action based
on a "buffer not available" diagnosis until it is certain that it has considered
all WQEs that have been successfully posted by the consumer.

Further, it MUST NOT require a further action by the consumer to guarantee
that it notices a posted WQE. Particularly in iWARP the application layer
is free to implement Send/Recv credits by *any* mechanism desired (the
only requirement is that there is one, you might recall that there were
extensive discussions on this point regarding unsolicited messages for
iSER). The concept that the application MUST provide SOME form of
flow control was accepted only grudgingly. So clearly any more specific
mechanisms were not the intent of the drafters.

So if there are still 1000 Recv WQEs in the SRQ we can allow the adapter
a great deal of flexibility in when the 1001st is linked into the data
structures.
The only real constraint is that it MUST do 1001 successful allocations
*before* it triggers any sort of "buffer not available" error.

I'm not recalling the specific language immediately, but I do recall concluding
that sub-dividing the SRQ on an RSS-like basis was *not* compliant with
the RDMAC specs and that the left-half of the adpater could not declare
"buffer not found" while the right-half of the adapter still had a free buffer.
This is of course a major pain if you are trying to team two RDMA adapters
to form a single virtual adapter, or even two largely independent ports on
the same physical adapter. But the intent of the specifications are very
clear: if the consumer has posted 1000 recv WQEs and gotten "SUCCESS"
to each of them, then the adapter MUST allocate all 1000 recv WQEs
*before* it can fail an operation because no buffer was available.

So there is a difference between "must be pushed to the adapter now"
and "must be pushed to the adapter before it is too late".



More information about the general mailing list