[ofa-general] Re: [PATCH 0/10 REV5] Implement skb batching and support in IPoIB/E1000

Krishna Kumar2 krkumar2 at in.ibm.com
Fri Sep 21 02:42:47 PDT 2007


Hi Or,

Sorry about the delay, I ran into various bugs and then system froze for
about 2 days.

Or Gerlitz <ogerlitz at voltaire.com> wrote on 09/17/2007 03:22:58 PM:

> good, please test with rev5 and let us know.

I tested with rev5 and this is what I found (different from what I
said earlier about EHCA):

Original code had almost zero retransmission (for a particular run),
1 for EHCA and 0 for MTHCA. With batching, both had high retransmissions:
73680 for EHCA and 70268 for MTHCA.

It seems I was wrong when I said EHCA was having no issues. So far
I have identical retransmission numbers for E1000 only.

> transmission of 4K batched packets sounds like a real problem for the
> receiver side, with 0.5K send/recv queue size, its 8 batches of 512
> packets each were for each RX there is completion (WC) to process, SKB
> to alloc and post to the QP where for the TX there's only posting to the
> QP, processes one (?) WC and free 512 SKBs.

The receiver and sender both have 4K WR's. I had earlier changed batching
so that IPoIB will send atmost 2 skbs even if more are present in the queue
and send 2 more after the first two and so on. But that too gave high
numbers for retransmissions.

> If indeed the situation is so unsymmetrical, I am starting to think that
> the CPU utilization at the sender side might be much higher with
> batching then without batching, have you looked into that?

Overall it is almost the same. I had used netperf (about 1 month back) and
it gave almost same numbers. I haven't tried recently. Even in regular
code, though batching is not done, qdisc_restart() does xmit in a tight
loop. The only difference is that dev->queue_lock is DROPPED/GOT for each
skb, and dev->tx_lock is held for shorter times. I avoid the former and
have no control for the latter.

> I am not with you. Looking on 2.6.22 and 2.6.23-rc5, for both their
> ipoib-NAPI mechanism is implemented through the function ipoib_poll
> being the polling api for the network stack etc, so what is the old API
> and where does this difference exist?

I meant the pre-Stephen-Hemminger converted NAPI. He had changed the
old NAPI to newer one (where driver doesn't get *budget, etc).

> You might want to try something lighter such as iperf udp test, where a
> nice criteria would be to compare bandwidth AND packet loss between
> no-batching and batching. As for the MTU, the default is indeed 2K
> (2044) but its always to just know the facts, namely what was the mtu
> during the test.

OK, that is a good idea. I will try it over the weekend.

> if you have user space libraries installed, load ib_uverbs and run the
> command ibv_devinfo, you will see all the infiniband devices on your
> system and for each its device id and firmware version. If not, you
> should be looking on
>
> /sys/class/infiniband/$device/hca_type
> and
> /sys/class/infiniband/$device/fw_ver

Both these files are not present, though ehca0 is present.
For mthca, the values are : MT23108 & 3.5.0.

Thanks,

- KK




More information about the general mailing list