[openib-general] Immediate data question

Thu Feb 15 09:42:37 PST 2007

At 09:37 PM 2/14/2007, Devesh Sharma wrote:
>On 2/14/07, Michael Krause <krause at cup.hp.com> wrote:
>>At 05:37 AM 2/13/2007, Devesh Sharma wrote:
>> >On 2/12/07, Devesh Sharma <devesh28 at gmail.com> wrote:
>> >>On 2/10/07, Tang, Changqing <changquing.tang at hp.com> wrote:
>> >> > > >
>> >> > > >Not for the receiver, but the sender will be severely slowed down by
>> >> > > >having to wait for the RNR timeouts.
>> >> > >
>> >> > > RNR = Receiver Not Ready so by definition, the data flow
>> >> > > isn't going to
>> >> > > progress until the receiver is ready to receive data.   If a
>> >> > > receive QP
>> >> > > enters RNR for a RC, then it is likely not progressing as
>> >> > > desired.   RNR
>> >> > > was initially put in place to enable a receiver to create
>> >> > > back pressure to the sender without causing a fatal error
>> >> > > condition.  It should rarely be entered and therefore should
>> >> > > have negligible impact on overall performance however when a
>> >> > > RNR occurs, no forward progress will occur so performance is
>> >> > > essentially zero.
>> >> >
>> >> > Mike:
>> >> >         I still do not quite understand this issue. I have two
>> >> > situations that have RNR triggered.
>> >> >
>> >> > 1. process A and process B is connected with QP. A first post a send to
>> >> > B, B does not post receive. Then A and B are doing a long time
>> >> > RDMA_WRITE each other, A and B just check memory for the RDMA_WRITE
>> >> > message. Finally B will post a receive. Does the first pending send 
>> in A
>> >> > block all the later RDMA_WRITE ?
>> >>According to IBTA spec HCA will process WR entries in strict order in
>> >>which they are posted so the send will block all WR posted after this
>> >>send, Until-unless HCA has multiple processing elements, I think even
>> >>then processing order will be maintained by HCA
>> >>  If not, since RNR is triggered
>> >> > periodically till B post receive, does it affect the RDMA_WRITE
>> >> > performance between A and B ?
>> >> >
>> >> > 2. extend above to three processes, A connect to B, B connect to C, 
>> so B
>> >> > has two QPs, but one CQ.A posts a send to B, B does not post receive,
>> >post ordering accross QP is not guaranteed hence presence of same CQ
>> >or different CQ will not affect any thing.
>> >> > rather B and C are doing a long time RDMA_WRITE,or send/recv. But B
>> >If RDMA WRITE _on_ B, no effect on performance. If RDMA WRITE _on_ C,
>I am sorry I have missed that in both cases same DMA channel is in use.
>> >_may_ affect the performance, since load is on same HCA. In case of
>> >Send/Recv again _may_ affect the performance, with the same reason.
>>
>>Seems orthogonal.  Any time h/w is shared, multiple flows will have an
>>impact on one another.  That is why we have the different arbitration
>>mechanisms to enable one to control that impact.
>Please, can you explain it more clearly?

Most I/O devices are shared by multiple applications / kernel 
subsystems.   Hence, the device acts as a serialization point for what goes 
on the wire / link.   Sharing = resource contention and in order to add any 
structure to that contention, a number of technologies provide arbitration 
options.   In the case of IB, the arbitration is confined to VL arbitration 
where a given data flow is assigned to a VL and that VL is services at some 
particular rate.   A number of years ago I wrote up how one might also 
provide QP arbitration (not part of the IBTA specifications) and I 
understand some implementations have incorporated that or a variation of 
the mechanisms into their products.

In addition to IB link contention, there is also PCI link / bus 
contention.   For PCIe, given most designs did not want to waste resources 
on multiple VC, there really isn't any standard arbitration 
mechanism.   However, many devices, especially a device like a HCA or a 
RNIC, already have the concept of separate resource domains, e.g. QP, and 
they provide a mechanism to associate how the QP's DMA requests or 
interrupts requests are scheduled to the PCIe link.

>> >> > must sends RNR periodically to A, right?. So does the pending message
>> >> > from A affects B's overall performance  between B and C ?
>> >But RNR NAK is not for very long time.....possibly this performance
>> >hit you will not be able to observe even. The moment rnr_counter
>> >expires connection will be broken!
>>
>>Keep in mind the timeout can be infinite.  RNR NAK are not expected to be
>>frequent so their performance impact was considered reasonable.
>Thanks I missed that.

It is a subtlety within the specification that is easy to miss.

Mike