[ofa-general] post_recv question

Thu Feb 21 11:10:24 PST 2008

On Thu, 2008-02-21 at 10:09 -0800, Shirley Ma wrote:
> Hello Steve,
> 
> > Here is the timing sequence:
> > 
> > t0: app calls post_recv
> > t1: post_recv code builds a hw-specific WR in the hw work queue
> > t2: post_recv code rings a doorbell (write to adapter mem or
> register)
> > t3: post_recv returns
> > t4: <app assumes the buffer is ready>

This is wrong. The HCA has control of the receive buffer
until poll_cq() returns a CQE saying the posted buffer
is completed (either OK or error).
Think about it. The application can do a post_recv() and
it could be days or nanoseconds before a packet is sent to
that buffer. The application can't assume anything about
the contents until the HCA says something is there.

Oh, I see. You are saying the application thinks the buffer
is available for the HCA to use.

> > t5: device HW dma engine moves the WR to adapter memory
> > t6: device FW prepares the HW RQ entry making the buffer available.
> > 
> > Note at time t4, the application thinks its ready, but its really
> not 
> > ready until t6.
> > This clearly is a implementation-specific issue.  But I was under
> the 
> > assumption that all the RDMA HW behaves this way.  Maybe not?

Not all hardware works the same.  You can't make assumptions
beyond what the library API guarantees without building
hardware specific dependencies into your program.
It can even change between different versions of microcode or
kernel software for the same HCA.

> > To further complicate things, this race condition is never seen _if_
> the 
> > application uses the same QP to advertise (send a credit allowing
> the 
> > peer to SEND) the RECV buffer availability.  So if the app posts a
> SEND 
> > after the RECV is posted and that SEND allows the peer access to
> the 
> > RECV buffer, then everything is ok.  This is due to the fact that
> the 
> > FW/HW will process the SEND only after processing the RECV.  If the
> app 
> > uses a different QP to post the SEND advertising the RECV, then the
> race 
> > condition exists allowing the peer to SEND into that RECV buffer
> before 
> > the HW makes it ready.

Well, there is no guarantee that the HCA processes the post_recv()
before the post_send() even on the same QP. Send and receive are
unordered with respect to each other. The fact that it works is
an HCA specific implementation artifact.

> > This all assumes a specific design of rdma hw.  Maybe nobody else
> has 
> > this issue?
> > 
> > Maybe I'm not making sense. :)
> 
> I think your descriptions here match what Ralph found RNR in IPoIB-CM.
> 
> Ralph,
> 
> Does this make sense?
> 
> Thanks
> Shirley

I think you are making sense.  There is an indeterminate race
between post_recv() returning to the application and when
a packet being received by the HCA might be able to use
that buffer. There are no ordering guarantees
between messages sent on one QP and another so the application
can't easily use a different QP to advertise posted buffers (credits).
That is why the IB RC protocol does this for you in band if the RC QP
is using a dedicated receive queue but not a shared receive queue.

The problem with shared receive queues is that the application
would have to pick an endpoint and tell it there is a buffer
available for the endpoint to send to. Obviously, if you have
two endpoints, they both can't send to the same receive buffer.

ib_ipoib uses shared receive queues and doesn't try to manage
posted buffer credits so the RNR NAK issue isn't the same
as what Steve is trying to do.