[openib-general] ip over ib throughtput

Wed Jan 12 17:34:06 PST 2005

On Tue, Jan 04, 2005 at 01:10:15PM -0800, Roland Dreier wrote:
>     Josh> I'm seeing about 364 MB/s between 2 PCIe Xeon 3.2GHz boxes
>     Josh> using netperf-2.3pl1.
> 
> Are you using MSI-X?  To use it, set CONFIG_PCI_MSI=y when you build
> your kernel and either "modprobe ib_mthca msi_x=1"...

Good news: Topspin firmware 3.3.2 can run netperf w/MSI-X on ia64 too
Bad news: I'm getting weak perf #s on the ZX1 boxes (~1580 Mbps == ~200MB/s)

This is with MSI-X enabled on both systems.
RX2600 sending TCP_Stream packets to RX4640 via topspin 12port switch.
Rx2600 has "Low Profile" (Cougarcub) and rx4640 has "Cougar" installed
in "dual rope" slots.

/opt/netperf/netperf -l 60 -H 10.0.1.81 -t TCP_STREAM -i 5,2 -I 99,5 -- -m 8192 -s 262144 -S 262144

TCP STREAM TEST to 10.0.1.81 : +/-2.5% @ 99% conf.
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

262142 262142   8192    60.00    1588.33   

q-syscollect on netperf client (RX2600, dual 1.5Ghz):

ionize:~/.q# q-view kernel-cpu1.info#0 | less
Flat profile of CPU_CYCLES in kernel-cpu1.hist#0:
 Each histogram sample counts as 1.00034m seconds
% time      self     cumul     calls self/call  tot/call name
 25.09     14.98     14.98     80.7k      186u      186u default_idle
  9.73      5.81     20.79     35.9M      162n      162n _spin_unlock_irqrestore
  5.63      3.36     24.15     27.8M      121n      136n ipt_do_table
  4.27      2.55     26.70     15.0M      170n      170n do_csum
  3.49      2.08     28.78     6.95M      300n      300n __copy_user
  2.66      1.59     30.37     14.3M      111n      673n nf_iterate
  2.63      1.57     31.94     5.82M      270n      729n tcp_transmit_skb
  2.59      1.54     33.49     68.5M     22.5n     33.2n local_bh_enable
  2.33      1.39     34.88     6.79M      205n         - tcp_packet
  1.83      1.09     35.97      355k     3.08u     32.4u tcp_sendmsg
  1.57      0.94     36.91     2.32M      405n     2.11u ipoib_ib_completion
  1.48      0.88     37.79     5.92M      149n      162n ip_queue_xmit
  1.46      0.87     38.67     2.46M      354n     2.41u mthca_eq_int
  1.20      0.72     39.39     6.93M      104n      376n ip_conntrack_in
  1.17      0.70     40.08     7.52M     92.6n     92.6n time_interpolator_get_o
ffset
...

And on the "netserver" (RX4640, 4 1.3Ghz) side:
t profile of CPU_CYCLES in kernel-cpu3.hist#0:
 Each histogram sample counts as 551.305u seconds
% time      self     cumul     calls self/call  tot/call name
 34.69     18.97     18.97     16.6M     1.15u     1.15u do_csum
  7.58      4.15     23.12     19.4M      213n      213n _spin_unlock_irqrestore
  6.67      3.65     26.76     61.4k     59.4u     59.4u default_idle
  5.33      2.91     29.68     22.3M      131n      149n ipt_do_table
  3.02      1.65     31.33     1.93M      856n     8.35u ipoib_ib_completion
  2.73      1.49     32.82     6.45M      231n      231n __copy_user
  2.61      1.43     34.25     11.2M      128n     1.32u nf_iterate
  2.30      1.26     35.51     5.55M      227n         - tcp_packet
  2.06      1.12     36.63     51.3M     21.9n     25.4n local_bh_enable
  1.97      1.08     37.71     5.51M      195n      273n tcp_v4_rcv
  1.43      0.78     38.49     1.77M      443n     9.63u mthca_eq_int
  1.35      0.74     39.23     5.28M      139n     1.93u netif_receive_skb
  1.19      0.65     39.88     5.60M      116n     1.59u ip_conntrack_in
  1.14      0.62     40.50     5.53M      113n     2.92u tcp_rcv_established
  1.03      0.56     41.06     5.31M      106n      135n ip_route_input
  1.02      0.56     41.62     5.24M      107n     1.80u ip_rcv
  0.91      0.50     42.12     5.43M     91.6n      369n ip_local_deliver_finish
  0.90      0.49     42.61     5.51M     89.7n     89.7n netif_rx
  0.89      0.49     43.10     1.93M      253n     9.13u handle_IRQ_event
  0.85      0.46     43.56     33.7M     13.8n     13.8n _read_lock_bh
...

_spin_unlock_irqrestore is a clue we are spending time in interrupt handlers
and that isn't getting measured.

top was reporting "netserver" consuming ~80% of one CPU
and netperf consuming ~60% of one CPU. Other cpu's were idle
on both boxes. Something else is slowing things down...I know
these boxes are capable of 800-900 MB/s on the PCI bus.

hth,
grant