[ofa-general] post_recv question
Steve Wise
swise at opengridcomputing.com
Thu Feb 21 11:32:46 PST 2008
Ralph Campbell wrote:
> On Thu, 2008-02-21 at 10:09 -0800, Shirley Ma wrote:
>> Hello Steve,
>>
>>> Here is the timing sequence:
>>>
>>> t0: app calls post_recv
>>> t1: post_recv code builds a hw-specific WR in the hw work queue
>>> t2: post_recv code rings a doorbell (write to adapter mem or
>> register)
>>> t3: post_recv returns
>>> t4: <app assumes the buffer is ready>
>
> This is wrong. The HCA has control of the receive buffer
> until poll_cq() returns a CQE saying the posted buffer
> is completed (either OK or error).
> Think about it. The application can do a post_recv() and
> it could be days or nanoseconds before a packet is sent to
> that buffer. The application can't assume anything about
> the contents until the HCA says something is there.
>
> Oh, I see. You are saying the application thinks the buffer
> is available for the HCA to use.
>
>>> t5: device HW dma engine moves the WR to adapter memory
>>> t6: device FW prepares the HW RQ entry making the buffer available.
>>>
>>> Note at time t4, the application thinks its ready, but its really
>> not
>>> ready until t6.
>>> This clearly is a implementation-specific issue. But I was under
>> the
>>> assumption that all the RDMA HW behaves this way. Maybe not?
>
> Not all hardware works the same. You can't make assumptions
> beyond what the library API guarantees without building
> hardware specific dependencies into your program.
>
I'm asking this from a device driver developer's perspective. I'm not
writing an application. I'm trying to understand and define exactly
what must be guaranteed by the device/driver up returning from
post_recv().
> It can even change between different versions of microcode or
> kernel software for the same HCA.
>
>>> To further complicate things, this race condition is never seen _if_
>> the
>>> application uses the same QP to advertise (send a credit allowing
>> the
>>> peer to SEND) the RECV buffer availability. So if the app posts a
>> SEND
>>> after the RECV is posted and that SEND allows the peer access to
>> the
>>> RECV buffer, then everything is ok. This is due to the fact that
>> the
>>> FW/HW will process the SEND only after processing the RECV. If the
>> app
>>> uses a different QP to post the SEND advertising the RECV, then the
>> race
>>> condition exists allowing the peer to SEND into that RECV buffer
>> before
>>> the HW makes it ready.
>
> Well, there is no guarantee that the HCA processes the post_recv()
> before the post_send() even on the same QP. Send and receive are
> unordered with respect to each other. The fact that it works is
> an HCA specific implementation artifact.
>
>>> This all assumes a specific design of rdma hw. Maybe nobody else
>> has
>>> this issue?
>>>
>>> Maybe I'm not making sense. :)
>> I think your descriptions here match what Ralph found RNR in IPoIB-CM.
>>
>> Ralph,
>>
>> Does this make sense?
>>
>> Thanks
>> Shirley
>
> I think you are making sense. There is an indeterminate race
> between post_recv() returning to the application and when
> a packet being received by the HCA might be able to use
> that buffer. There are no ordering guarantees
> between messages sent on one QP and another so the application
> can't easily use a different QP to advertise posted buffers (credits).
> That is why the IB RC protocol does this for you in band if the RC QP
> is using a dedicated receive queue but not a shared receive queue.
>
Do you mean the IB RC protocol advertises credits as part of the
transport protocol?
> The problem with shared receive queues is that the application
> would have to pick an endpoint and tell it there is a buffer
> available for the endpoint to send to. Obviously, if you have
> two endpoints, they both can't send to the same receive buffer.
>
> ib_ipoib uses shared receive queues and doesn't try to manage
> posted buffer credits so the RNR NAK issue isn't the same
> as what Steve is trying to do.
>
More information about the general
mailing list