[ofa-general] Re: [RFH] IPoIB retransmission when sending multiple WR's to device

Fri Aug 3 02:54:56 PDT 2007

Hi Roland,

(More results, not sure what all this adds up to)

I did one last experiment :-)

ipoib_start_xmit_frames()
{
      while (skbs) {
            process skb & put on tx_wr, tx_ring, etc.
            if there are two skbs processed {
                  send them out now,
                  if there are more skbs in the batch list
                        return MORE to IP.
            }
      }
}

qdisc_restart()
{
      ret = dev->hard_start_xmit_frames();
      switch (ret) {
      case OK, BUSY, LOCK, etc: Original code; break;
      case MORE:
            ret = 1; /* Call again to the driver */
}

Even with this, I am getting large number of retransmissions. I am sending
only two
and returning back to IP (resulting in one send with 2 WR's), which is
almost the
same as what the original code does except it sends one and returns
(resulting in
two sends with 1 WR each). Also, if I change 2 to 1, the retranmission
disappears.
I also tried one more change : queue to tx_wr in reverse order to see if
retransmission
stops due to some weird sending in reverse, but no.

When I keep this max number of skbs to send as 2, I get fewer
retransmission (50,000
for 2 min run) and 220KB/s, and when it is increased to 256, I get high
retransmission
(200,000 for 2 min run) and better BW (235KB/s).

Does this mean that sometimes multiple WR's are not getting sent out at
all, or that
sometimes only one is (and the other goes out via retransmission) ?

Note:  The Batching API is not called in most cases, it is called only if
queue was
stopped or tx lock was not got, and skbs accumulate in the qdisc queue.

Thanks,

- KK

__________________

Hi Roland,

I did one more test to check the out-of-order theory. I changed my new API
to be:

/* Original code, unmodified */
ipoib_start_xmit()
{
      original code
}

/* Added new xmit which is identical to original code but doesn't get the
lock */
ipoib_start_xmit_nolock()
{
      original code but without getting lock
}

/* Batching API */
ipoib_start_xmit_batch()
{
      get_lock()
      while (skbs in queue) {
            ret = ipoib_start_xmit_nolock()
      }
      unlock()
}

This in effect is fast sends of multiple skbs while holding the lock once
for all the skbs
without going back to the IP stack. The only difference is that this
creates one WR per
skb instead of multiple WR's. I got identical retranmissions (around 240
for the full run)
as compared to original code (225), while the multiple WR's code had
>100,000
retransmissions.

I agree that sending multiple WR's is still faster than this, but this is
second best I could
do, and still there is no increase in retransmissions. The tx_queue_len on
ib0 is 8192,
recv/send size are both 4K. The receiver shows no errors/drops in ifconfig
nor netstat -s.

Is there anything that can be concluded/suspected, or something else I
could try ?

The latest run of the batching API gave me 719,679 retransmissions for a 16
min test run
of 16/64 threads (iperf), which comes to 600 retransmissions per sec, which
is more than
the retransmissions during the entire run for the regular code!

thanks,

- KK

__________________

Hi Roland,

Roland Dreier <rdreier at cisco.com> wrote on 08/02/2007 09:59:23 PM:

>  > On the same topic that I wrote about earlier, I put debugs
>  > in my code to store all skbs in bufferA when enqueing multiple
>  > skbs, and store all skbs to bufferB just before doing post.
>  > During post, I compare the two buffers to make sure that I am
>  > not posting in the wrong order, and that never happens.
>  >
>  > But I am getting a huge amount of retransmissions anyway,
>
> Why do you think the retransmissions are related to things being sent
> out of order?  Is it possible you're just sending much faster and
> overrunning the receiver's queue of posted receives?

I cannot be sure of that. But in regular code too, batching is done
in qdisc_run() in a different sense - it sends out packets *iteratively*.
In this case, I see only 225 retransmission for the entire run of all
tests (while in my code, I see 100,000 or more (I think I gave wrong
number earlier, this is the right one - 200 vs 100,000).

Is there any way to avoid the situation you are talking about ? I am
already setting recv_queue_size=4096 when loading ipoib (so for mthca
too).

Thanks,

- KK