[openib-general] IPoIB performance.

Mon Dec 27 08:21:13 PST 2004

Ido> 1. We can divide the single CQ into two separate completion
    Ido> queues: one for the RQ and the other for SQ.  Then we can
    Ido> change the CQ policy affiliated with the SQ into
    Ido> IB_CQ_CONSUMER_REARM and in mainstream not arm the CQ.  In
    Ido> such case the poll_cq_tq will be called from the send_packet
    Ido> method, and will reap completions without any need for
    Ido> interrupts/events.  Obviously in cases when we have to stop
    Ido> the queue ( e.g. no more room available) , we need to arm the
    Ido> CQ until completions arrive.  In general, this change reduces
    Ido> the interrupt rate. It may also help when posting and polling
    Ido> on the SQ, happens from two different processors
    Ido> (e.g. spinlock clash).

I guess you also need to set a timer to poll the send queue so that
you eventually get a completion for all sends, even when a packet is
sent and another one isn't sent for a while.

Also we would need to try a variety of workloads, because splitting
the send and receive CQs means that we will always have to lock the CQ
twice to poll sends and completions.  For example "NPtcp -2" would be
a useful test.

    Ido> 2. The current IPoIB driver is signaling every transmitting
    Ido> packet.  We can improve performance by selective signaling
    Ido> )e.g. every 5 to 10 packets(.  Note that we did notice
    Ido> several problems when doing it. This approach can have a
    Ido> problem in case of for example ping (ICMP) which do not
    Ido> allocate new buffers before the first ones are released. I
    Ido> can think of some WA to this problem such as send a dummy
    Ido> packet every now or then (which won't go out of the device).
    Ido> this includes changing the send policy to be
    Ido> IB_WQ_SIGNAL_SELECTABLE.  When a selective WQE completes, the
    Ido> ipoib driver has to internally complete the rest of WQEs that
    Ido> weren't signaled. This change mainly reduces the overhead
    Ido> required by the HW driver to poll on the completion queue.

I've looked at this in the past as well.  As you point out, the kernel
needs the destructor of an skb being sent to be called before it can
free space in a socket buffer, so you would need to be very clever
here.  I'll be curious to see your approach.

    Ido> 3. I think we should call netif_wake_queue only if the queue
    Ido> is actually stopped because as far as I understand, the
    Ido> kernel can schedule another process after calling wake queue.

Thanks for pointing this out (although of course netif_wake_queue and
__netif_schedule don't actually schedule a new process -- since we're
calling netif_wake_queue from interrupt context they couldn't in any
case).  The expensive thing seems to be the clearing and restoring
interrupts in __netif_schedule.  In any case I committed the change
below, which seems to be about a 3% improvement.

    Ido> I have tried these changes and got ~20% improvement in
    Ido> performance )2 Tavor machines with dual CPU*3.1 GH (Xeon),
    Ido> AS3.0 U3) If you find it interesting, I can work on a patch
    Ido> for gen2.

If you write a patch and do some benchmarking, that would be great.

Thanks,
  Roland

Index: ulp/ipoib/ipoib_ib.c
===================================================================

--- ulp/ipoib/ipoib_ib.c	(revision 1383)
+++ ulp/ipoib/ipoib_ib.c	(working copy)
@@ -249,7 +249,8 @@
 
 		spin_lock_irqsave(&priv->tx_lock, flags);
 		++priv->tx_tail;
-		if (priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2)
+		if (netif_queue_stopped(dev) &&
+		    priv->tx_head - priv->tx_tail <= IPOIB_TX_RING_SIZE / 2)
 			netif_wake_queue(dev);
 		spin_unlock_irqrestore(&priv->tx_lock, flags);