[openib-general] [PATCH][3/7]ipoib performance patches -- remove tx_ring

Fri May 26 15:48:46 PDT 2006

Shirley> This patch reduces one extra ring between dev xmit queue
    Shirley> and device send queue and removes tx_lock in completion
    Shirley> handler.

I don't think replacing the ring with a linked list is an
improvement.  At best it works out to be the same.

But I didn't notice that you removed the tx_lock from the send
completion handler before.  I suspect that this is where all the
improvement is coming from.  Unfortunately it introduces a couple of
race conditions.  For example, if CPU1 is in ipoib_send() and CPU2 is
running the send completion handler, you can have:

  CPU1                                  CPU2

  err = post_send(priv, wr_id, address->ah, qpn, addr, skb->len);
  // err is non-zero
  if (!err) {
  } else {
    if (!netif_queue_stopped(dev)) {
 
                                        // send completion handler
                                        // runs many times and drains
                                        // completion queue
                                        // queue is never stopped so
                                        // it never gets woken again
                                        if (netif_queue_stopped(dev))
                                          netif_wake_queue(dev);


     netif_stop_queue(dev);

                                        // no more sends are posted so
                                        // another completion never
                                        // occurs and the queue stays
                                        // stopped forever.

There are analogous races that cause spurious wakeups too.

However I think this is a good idea, and I think we can get around the
races by working around them in a way that doesn't hurt performance
(eg by having the netdev tx watchdog restart the queue in the scenario
above).  I cooked up a quick patch (below) but I didn't see any
performance improvement from this in my quick tests -- NPtcp peaks at
~3400 Mbit/sec between my test systems with 2.6.17-rc5 both with and
without this patch.

Then I tried to apply your patches 1/7 and 3/7 to see if they were any
different on my setup, but the patches didn't apply.  3/7 had no
attachment, and it's hopeless to try and apply the patches you send
inline.  So can you resend up-to-date versions of the patches that
give you a 10% improvement?  (BTW it would be nice if you could figure
out a way to fix your mail client to post patches inline without
mangling them, or at least attach them with a mime type of text/plain
or something)

Also, if you're interested, you could try the patch below and see how
it does on your tests.

 - R.

Here's my tx_lock removal test patch (it would need better comments
about all the races with tx_tail and some more careful review before
we actually apply it):

--- infiniband/ulp/ipoib/ipoib_main.c	(revision 7507)
+++ infiniband/ulp/ipoib/ipoib_main.c	(working copy)
@@ -634,6 +634,14 @@ static int ipoib_start_xmit(struct sk_bu
 		return NETDEV_TX_BUSY;
 	}
 
+	/*
+	 * Because tx_lock is not held when updating tx_tail in the
+	 * send completion handler, we may receive a spurious wakeup
+	 * that starts our queue when there really isn't space yet.
+	 */
+	if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size))
+		return NETDEV_TX_BUSY;
+
 	if (skb->dst && skb->dst->neighbour) {
 		if (unlikely(!*to_ipoib_neigh(skb->dst->neighbour))) {
 			ipoib_path_lookup(skb, dev);
@@ -703,6 +711,21 @@ static struct net_device_stats *ipoib_ge
 static void ipoib_timeout(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned long flags;
+	int lost_wakeup = 0;
+
+	spin_lock_irqsave(&priv->tx_lock, flags);
+	if (netif_queue_stopped(dev) &&
+	    priv->tx_head - priv->tx_tail < ipoib_sendq_size) {
+		ipoib_dbg(priv, "lost wakeup, head %u, tail %u\n",
+			  priv->tx_head, priv->tx_tail);
+		lost_wakeup = 1;
+		netif_wake_queue(dev);
+	}
+	spin_unlock_irqrestore(&priv->tx_lock, flags);
+
+	if (lost_wakeup)
+		return;
 
 	ipoib_warn(priv, "transmit timeout: latency %d msecs\n",
 		   jiffies_to_msecs(jiffies - dev->trans_start));
--- infiniband/ulp/ipoib/ipoib_ib.c	(revision 7511)
+++ infiniband/ulp/ipoib/ipoib_ib.c	(working copy)
@@ -244,7 +244,6 @@ static void ipoib_ib_handle_wc(struct ne
 
 	} else {
 		struct ipoib_tx_buf *tx_req;
-		unsigned long flags;
 
 		if (wr_id >= ipoib_sendq_size) {
 			ipoib_warn(priv, "completion event with wrid %d (> %d)\n",
@@ -266,12 +265,17 @@ static void ipoib_ib_handle_wc(struct ne
 
 		dev_kfree_skb_any(tx_req->skb);
 
-		spin_lock_irqsave(&priv->tx_lock, flags);
 		++priv->tx_tail;
+
+		/*
+		 * Since we don't hold tx_lock here, this may lead to
+		 * both lost wakeups (which we deal with in our
+		 * watchdog) and spurious wakeups (which we deal with
+		 * by handling TX ring overflows in the xmit function).
+		 */
 		if (netif_queue_stopped(dev) &&
 		    priv->tx_head - priv->tx_tail <= ipoib_sendq_size >> 1)
 			netif_wake_queue(dev);
-		spin_unlock_irqrestore(&priv->tx_lock, flags);
 
 		if (wc->status != IB_WC_SUCCESS &&
 		    wc->status != IB_WC_WR_FLUSH_ERR)