[openib-general] IPoIB/ia64 cacheline misses

Sat Jan 22 00:18:18 PST 2005

Here is use of pfmon 3.1 to sample address events.
In short, nothing here jumps out at me and screams
for a big opportunity to optimize. But I don't fully
understand the data and instruction flow either.
Maybe someone else sees more opportunity.

e.g. I'm wondering if netfilter is a significant
performance hit.

This first set of data is essentially all cache misses.
No surprise that spinlock seems to be the worst offender.

ionize:~# fgrep MSI /proc/interrupts ;  pfmon31 --us-c --cpu-list=0 --pin-command --resolve-addr --smpl-module=dear-hist-itanium2 -e data_ear_cache_lat64 --long-smpl-periods=500 --smpl-periods-random=0xfff:10 --system-wide -k -- /opt/netperf/netperf -t TCP_STREAM -l 50 -H 10.0.1.81 -- -s 262144 -S 262144 -m 4064 ; fgrep MSI /proc/interrupts
 64:    5324909          0       PCI-MSI-X  ib_mthca (comp)
 65:          2          0       PCI-MSI-X  ib_mthca (async)
 66:       2203          0       PCI-MSI-X  ib_mthca (cmd)
added event set 0
only kernel symbols are resolved in system-wide mode
TCP STREAM TEST to 10.0.1.81
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

262142 262142   4064    50.00    1561.24   
# total_samples 710
# instruction addr view
# sorted by count
# showing per per distinct value
# %L2  : percentage of L1 misses that hit L2
# %L3  : percentage of L1 misses that hit L3
# %RAM : percentage of L1 misses that hit memory
# L2   :  5 cycles load latency
# L3   : 14 cycles load latency
# sampling period: 500
# #count %self    %cum     %L2     %L3    %RAM   instruction addr
     178  25.07%  25.07%   0.00%   0.00% 100.00% 0xa0000001006b3061 _spin_lock+0x21<kernel>
     136  19.15%  44.23%   0.00%   0.00% 100.00% 0xa00000020014d120 __initcall_end+0xff91b468<kernel>
      86  12.11%  56.34%   0.00%   0.00% 100.00% 0xa000000100593a21 alloc_skb+0x101<kernel>
      85  11.97%  68.31%   0.00%   0.00% 100.00% 0xa00000020014d790 __initcall_end+0xff91bad8<kernel>
      85  11.97%  80.28%   0.00%   0.00% 100.00% 0xa000000200149c11 __initcall_end+0xff917f59<kernel>
      51   7.18%  87.46%   0.00%   0.00% 100.00% 0xa0000001006b3291 _spin_lock_irqsave+0x31<kernel>
      32   4.51%  91.97%   0.00%   0.00% 100.00% 0xa0000001002de7e0 memmove+0x40<kernel>
      25   3.52%  95.49%   0.00%   0.00% 100.00% 0xa000000100594550 skb_clone+0x490<kernel>
       9   1.27%  96.76%   0.00%   0.00% 100.00% 0xa000000200176970 ipoib_add_port[ib_ipoib]+0x1730<kernel>
       7   0.99%  97.75%   0.00%   0.00% 100.00% 0xa00000010008c101 profile_hit+0x1<kernel>
       4   0.56%  98.31%   0.00%   0.00% 100.00% 0xa00000010007f471 find_busiest_group+0x5d1<kernel>
       3   0.42%  98.73%   0.00%   0.00% 100.00% 0xa00000010007f470 find_busiest_group+0x5d0<kernel>
       1   0.14%  98.87%   0.00%   0.00% 100.00% 0xa0000001005e7ff0 tcp_sendmsg+0x1b50<kernel>
       1   0.14%  99.01%   0.00%   0.00% 100.00% 0xa00000020014d860 __initcall_end+0xff91bba8<kernel>
       1   0.14%  99.15%   0.00%   0.00% 100.00% 0xa00000010054d541 ohci_hub_status_data+0x2a1<kernel>
       1   0.14%  99.30%   0.00%   0.00% 100.00% 0xa0000001002e37f0 do_csum+0x70<kernel>
       1   0.14%  99.44%   0.00%   0.00% 100.00% 0xa0000001006b0cc0 schedule+0xbe0<kernel>
       1   0.14%  99.58%   0.00%   0.00% 100.00% 0xa00000010000a2e1 load_switch_stack+0xc1<kernel>
       1   0.14%  99.72%   0.00%   0.00% 100.00% 0xa0000001000a17e0 time_interpolator_update+0x140<kernel>
       1   0.14%  99.86%   0.00%   0.00% 100.00% 0xa0000001006b01d0 schedule+0xf0<kernel>
       1   0.14% 100.00%   0.00%   0.00% 100.00% 0xa00000010000a2a1 load_switch_stack+0x81<kernel>
# level 0 : counts=0 avg_cycles=0.0ms   0.00%
# level 1 : counts=0 avg_cycles=0.0ms   0.00%
# level 2 : counts=710 avg_cycles=0.0ms 100.00%
approx cost: 0.0s
 64:    7187185          0       PCI-MSI-X  ib_mthca (comp)
 65:          2          0       PCI-MSI-X  ib_mthca (async)
 66:       2203          0       PCI-MSI-X  ib_mthca (cmd)

Corresponding symbols for __initcall_end+fff... entries:
	0xa00000020014d120 
	0xa00000020014d790
		a00000020014cfe0 t mthca_poll_cq        [ib_mthca]
		a00000020014dd60 t mthca_arm_cq [ib_mthca]
		a00000020014de20 t mthca_init_cq        [ib_mthca]

	0xa000000200149c11
		a000000200149b80 t mthca_eq_int [ib_mthca]
		a00000020014a660 t mthca_interrupt      [ib_mthca]

I'm not sure if prefetching is feasible for the above.
And I'm assuming the consumer/producer for one of the IPoIB queues
is why the spinlock is bouncing. So really not that much interesting
to look at from a CPU perspective. :^(

To help put the first data set into perspective, the next table is samples
for all loads that took more than 4 cycles.  This is essentially all
levels of cache usage. Table is ordered by number of samples for a
given IP, not how many cycles a particular IP used to load data.
Later entries in the list may have consumed more cycles.

I've truncated the list but can make the full list available.

ionize:~# fgrep MSI /proc/interrupts ;  pfmon31 --us-c --cpu-list=0 --pin-command --resolve-addr --smpl-module=dear-hist-itanium2 -e data_ear_cache_lat4 --long-smpl-periods=100 --smpl-periods-random=0xfff:10 --system-wide -k -- /opt/netperf/netperf -t TCP_STREAM -l 10 -H 10.0.1.81 -- -s 262144 -S 262144 -m 4064 > lat4.out ; fgrep MSI /proc/interrupts 

------------ lat4.out --------------
TCP STREAM TEST to 10.0.1.81
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

262142 262142   4064    10.00    1562.12   
added event set 0
# total_samples 2048
# instruction addr view
# sorted by count
# showing per per distinct value
# %L2  : percentage of L1 misses that hit L2
# %L3  : percentage of L1 misses that hit L3
# %RAM : percentage of L1 misses that hit memory
# L2   :  5 cycles load latency
# L3   : 14 cycles load latency
# sampling period: 100
# #count %self    %cum     %L2     %L3    %RAM   instruction addr
     161   7.86%   7.86%   0.00%  99.38%   0.62% 0xa0000001002e38e1 do_csum+0x161<kernel>
     137   6.69%  14.55%  57.66%  42.34%   0.00% 0xa0000001002e38e0 do_csum+0x160<kernel>
      46   2.25%  16.80%   8.70%  91.30%   0.00% 0xa000000100602930 tcp_transmit_skb+0xb10<kernel>
      42   2.05%  18.85%   0.00% 100.00%   0.00% 0xa0000001005e7550 tcp_sendmsg+0x10b0<kernel>
      40   1.95%  20.80%   0.00%  95.00%   5.00% 0xa00000010000aa70 rse_clear_invalid+0x60<kernel>
      36   1.76%  22.56%   0.00%  22.22%  77.78% 0xa0000001006b3580 _read_lock_bh+0x20<kernel>
      35   1.71%  24.27%  22.86%  71.43%   5.71% 0xa0000001002e2540 __copy_user+0x100<kernel>
      33   1.61%  25.88%   0.00%  84.85%  15.15% 0xa0000001002e2561 __copy_user+0x121<kernel>
      32   1.56%  27.44%   0.00%  90.62%   9.38% 0xa0000001002e2541 __copy_user+0x101<kernel>
      31   1.51%  28.96%   3.23%  96.77%   0.00% 0xa0000001005e7560 tcp_sendmsg+0x10c0<kernel>
      28   1.37%  30.32%  67.86%  32.14%   0.00% 0xa0000001005bbd51 nf_iterate+0x131<kernel>
      25   1.22%  31.54%   0.00%  32.00%  68.00% 0xa0000001006b3061 _spin_lock+0x21<kernel>
      23   1.12%  32.67%   4.35%  91.30%   4.35% 0xa0000001002e2560 __copy_user+0x120<kernel>
      19   0.93%  33.59%   0.00% 100.00%   0.00% 0xa0000001006b3950 _read_unlock_bh+0x10<kernel>
      18   0.88%  34.47%   0.00%  16.67%  83.33% 0xa0000001006b3291 _spin_lock_irqsave+0x31<kernel>
      18   0.88%  35.35%  61.11%  38.89%   0.00% 0xa000000100655891 ipt_do_table+0x191<kernel>
      17   0.83%  36.18% 100.00%   0.00%   0.00% 0xa0000001005bbcb1 nf_iterate+0x91<kernel>
      16   0.78%  36.96%  75.00%  25.00%   0.00% 0xa0000001006557f0 ipt_do_table+0xf0<kernel>
...
# level 0 : counts=669 avg_cycles=0.0ms  32.67%
# level 1 : counts=1230 avg_cycles=0.0ms  60.06%
# level 2 : counts=149 avg_cycles=0.0ms   7.28%

grant