[ewg] [PATCH v2] IB/ipoib: copy small SKBs in CM mode

Eli Cohen eli at dev.mellanox.co.il
Tue May 27 02:05:48 PDT 2008


>From db74e3fc04ef41da02d65c056b78275365891b3d Mon Sep 17 00:00:00 2001
From: Eli Cohen <eli at mellanox.co.il>
Date: Thu, 22 May 2008 16:28:59 +0300
Subject: [PATCH] IB/ipoib: copy small SKBs in CM mode

CM mode of ipoib has a large overhead in the receive flow for managing
SKBs. It usually allocates an SKB with data as much as was used in the
currently received SKB  and moves unused fragments from the old SKB to the
new one. This involves a loop on all the remaining fragments and incurs
overhead on the CPU.
This patch, for small SKBs, allocates an SKB just large enough to contain
the received data and copies to it the data from the received SKB.
The newly allocated SKB is passed to the stack and the old SKB is reposted.

When running netperf, UDP small messages, without this pach I get:

UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
14.4.3.178 (14.4.3.178) port 0 AF_INET
Socket  Message  Elapsed      Messages
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

114688     128   10.00     5142034      0     526.31
114688           10.00     1130489            115.71

With this patch I get both send and receive at ~315 mbps.
The reason for that is as follows:
When using this patch, the overhead of the CPU for handling RX packets
is dramatically reduced. As a result, we do not experience RNR NACK
messages from the receiver which cause the connection to be closed and
reopened again; when the patch is not used, the receiver cannot handle
the packets fast enough so there is less time to post new buffers and
hence the mentioned RNR NACKs. So what happens is that the application
*thinks* it posted a certain number of packets for transmission but these
packets are flushed and do not really get transmitted. Since the connection
gets opened and closed many times, each time netperf gets the CPU time that
otherwise would have been given to IPoIB  to actually transmit the packtes.
This can be verified when looking at the port counters, the output of
ifconfig and the oputput of netperf (this is for the case without the patch):

tx packets
==========
port counter:   1,543,996
ifconfig:       1,581,426
netperf:        5,142,034

rx packtes
==========
netperf         1,1304,089

Signed-off-by: Eli Cohen <eli at mellanox.co.il>
---
Changes since V1:
1. wrapped call to skb_copy_from_linear_data() with calls to
ib_dma_sync_single_for_cpu() and ib_dma_sync_single_for_device()
2. Ensure SKB_TSHOLD is not defined to large.

 drivers/infiniband/ulp/ipoib/ipoib.h      |    1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c   |   19 +++++++++++++++++++
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   10 ++++++++++
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index ca126fc..e39bf36 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -97,6 +97,7 @@ enum {
 	IPOIB_MCAST_FLAG_ATTACHED = 3,
 
 	MAX_SEND_CQE		  = 16,
+	SKB_TSHOLD		  = 256,
 };
 
 #define	IPOIB_OP_RECV   (1ul << 31)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 97e67d3..7be0a43 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -525,6 +525,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	u64 mapping[IPOIB_CM_RX_SG];
 	int frags;
 	int has_srq;
+	struct sk_buff *small_skb;
 
 	ipoib_dbg_data(priv, "cm recv completion: id %d, status: %d\n",
 		       wr_id, wc->status);
@@ -579,6 +580,23 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		}
 	}
 
+	if (wc->byte_len < SKB_TSHOLD) {
+		int dlen = wc->byte_len;
+
+		small_skb = dev_alloc_skb(dlen + 12);
+		if (small_skb) {
+			skb_reserve(small_skb, 12);
+			ib_dma_sync_single_for_cpu(priv->ca, rx_ring[wr_id].mapping[0],
+						   dlen, DMA_FROM_DEVICE);
+			skb_copy_from_linear_data(skb, small_skb->data, dlen);
+			ib_dma_sync_single_for_device(priv->ca, rx_ring[wr_id].mapping[0],
+						      dlen, DMA_FROM_DEVICE);
+			skb_put(small_skb, dlen);
+			skb = small_skb;
+			goto copied;
+		}
+	}
+
 	frags = PAGE_ALIGN(wc->byte_len - min(wc->byte_len,
 					      (unsigned)IPOIB_CM_HEAD_SIZE)) / PAGE_SIZE;
 
@@ -601,6 +619,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 
 	skb_put_frags(skb, IPOIB_CM_HEAD_SIZE, wc->byte_len, newskb);
 
+copied:
 	skb->protocol = ((struct ipoib_header *) skb->data)->proto;
 	skb_reset_mac_header(skb);
 	skb_pull(skb, IPOIB_ENCAP_LEN);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 2442090..ec6e7c5 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1304,6 +1304,16 @@ static int __init ipoib_init_module(void)
 	ipoib_max_conn_qp = min(ipoib_max_conn_qp, IPOIB_CM_MAX_CONN_QP);
 #endif
 
+	/*
+	 * we rely on this condition when copying small skbs and we
+	 * pass ownership of the first fragment only.
+	 */
+	if (SKB_TSHOLD > IPOIB_CM_HEAD_SIZE) {
+		printk("%s: SKB_TSHOLD(%d) must not be larger then %d\n",
+		       THIS_MODULE->name, SKB_TSHOLD, IPOIB_CM_HEAD_SIZE);
+		return -EINVAL;
+	}
+
 	ret = ipoib_register_debugfs();
 	if (ret)
 		return ret;
-- 
1.5.5.1






More information about the ewg mailing list