[openib-general] IPoIB/ia64 cacheline misses
Grant Grundler
iod00d at hp.com
Sat Jan 22 00:18:18 PST 2005
Here is use of pfmon 3.1 to sample address events.
In short, nothing here jumps out at me and screams
for a big opportunity to optimize. But I don't fully
understand the data and instruction flow either.
Maybe someone else sees more opportunity.
e.g. I'm wondering if netfilter is a significant
performance hit.
This first set of data is essentially all cache misses.
No surprise that spinlock seems to be the worst offender.
ionize:~# fgrep MSI /proc/interrupts ; pfmon31 --us-c --cpu-list=0 --pin-command --resolve-addr --smpl-module=dear-hist-itanium2 -e data_ear_cache_lat64 --long-smpl-periods=500 --smpl-periods-random=0xfff:10 --system-wide -k -- /opt/netperf/netperf -t TCP_STREAM -l 50 -H 10.0.1.81 -- -s 262144 -S 262144 -m 4064 ; fgrep MSI /proc/interrupts
64: 5324909 0 PCI-MSI-X ib_mthca (comp)
65: 2 0 PCI-MSI-X ib_mthca (async)
66: 2203 0 PCI-MSI-X ib_mthca (cmd)
added event set 0
only kernel symbols are resolved in system-wide mode
TCP STREAM TEST to 10.0.1.81
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
262142 262142 4064 50.00 1561.24
# total_samples 710
# instruction addr view
# sorted by count
# showing per per distinct value
# %L2 : percentage of L1 misses that hit L2
# %L3 : percentage of L1 misses that hit L3
# %RAM : percentage of L1 misses that hit memory
# L2 : 5 cycles load latency
# L3 : 14 cycles load latency
# sampling period: 500
# #count %self %cum %L2 %L3 %RAM instruction addr
178 25.07% 25.07% 0.00% 0.00% 100.00% 0xa0000001006b3061 _spin_lock+0x21<kernel>
136 19.15% 44.23% 0.00% 0.00% 100.00% 0xa00000020014d120 __initcall_end+0xff91b468<kernel>
86 12.11% 56.34% 0.00% 0.00% 100.00% 0xa000000100593a21 alloc_skb+0x101<kernel>
85 11.97% 68.31% 0.00% 0.00% 100.00% 0xa00000020014d790 __initcall_end+0xff91bad8<kernel>
85 11.97% 80.28% 0.00% 0.00% 100.00% 0xa000000200149c11 __initcall_end+0xff917f59<kernel>
51 7.18% 87.46% 0.00% 0.00% 100.00% 0xa0000001006b3291 _spin_lock_irqsave+0x31<kernel>
32 4.51% 91.97% 0.00% 0.00% 100.00% 0xa0000001002de7e0 memmove+0x40<kernel>
25 3.52% 95.49% 0.00% 0.00% 100.00% 0xa000000100594550 skb_clone+0x490<kernel>
9 1.27% 96.76% 0.00% 0.00% 100.00% 0xa000000200176970 ipoib_add_port[ib_ipoib]+0x1730<kernel>
7 0.99% 97.75% 0.00% 0.00% 100.00% 0xa00000010008c101 profile_hit+0x1<kernel>
4 0.56% 98.31% 0.00% 0.00% 100.00% 0xa00000010007f471 find_busiest_group+0x5d1<kernel>
3 0.42% 98.73% 0.00% 0.00% 100.00% 0xa00000010007f470 find_busiest_group+0x5d0<kernel>
1 0.14% 98.87% 0.00% 0.00% 100.00% 0xa0000001005e7ff0 tcp_sendmsg+0x1b50<kernel>
1 0.14% 99.01% 0.00% 0.00% 100.00% 0xa00000020014d860 __initcall_end+0xff91bba8<kernel>
1 0.14% 99.15% 0.00% 0.00% 100.00% 0xa00000010054d541 ohci_hub_status_data+0x2a1<kernel>
1 0.14% 99.30% 0.00% 0.00% 100.00% 0xa0000001002e37f0 do_csum+0x70<kernel>
1 0.14% 99.44% 0.00% 0.00% 100.00% 0xa0000001006b0cc0 schedule+0xbe0<kernel>
1 0.14% 99.58% 0.00% 0.00% 100.00% 0xa00000010000a2e1 load_switch_stack+0xc1<kernel>
1 0.14% 99.72% 0.00% 0.00% 100.00% 0xa0000001000a17e0 time_interpolator_update+0x140<kernel>
1 0.14% 99.86% 0.00% 0.00% 100.00% 0xa0000001006b01d0 schedule+0xf0<kernel>
1 0.14% 100.00% 0.00% 0.00% 100.00% 0xa00000010000a2a1 load_switch_stack+0x81<kernel>
# level 0 : counts=0 avg_cycles=0.0ms 0.00%
# level 1 : counts=0 avg_cycles=0.0ms 0.00%
# level 2 : counts=710 avg_cycles=0.0ms 100.00%
approx cost: 0.0s
64: 7187185 0 PCI-MSI-X ib_mthca (comp)
65: 2 0 PCI-MSI-X ib_mthca (async)
66: 2203 0 PCI-MSI-X ib_mthca (cmd)
Corresponding symbols for __initcall_end+fff... entries:
0xa00000020014d120
0xa00000020014d790
a00000020014cfe0 t mthca_poll_cq [ib_mthca]
a00000020014dd60 t mthca_arm_cq [ib_mthca]
a00000020014de20 t mthca_init_cq [ib_mthca]
0xa000000200149c11
a000000200149b80 t mthca_eq_int [ib_mthca]
a00000020014a660 t mthca_interrupt [ib_mthca]
I'm not sure if prefetching is feasible for the above.
And I'm assuming the consumer/producer for one of the IPoIB queues
is why the spinlock is bouncing. So really not that much interesting
to look at from a CPU perspective. :^(
To help put the first data set into perspective, the next table is samples
for all loads that took more than 4 cycles. This is essentially all
levels of cache usage. Table is ordered by number of samples for a
given IP, not how many cycles a particular IP used to load data.
Later entries in the list may have consumed more cycles.
I've truncated the list but can make the full list available.
ionize:~# fgrep MSI /proc/interrupts ; pfmon31 --us-c --cpu-list=0 --pin-command --resolve-addr --smpl-module=dear-hist-itanium2 -e data_ear_cache_lat4 --long-smpl-periods=100 --smpl-periods-random=0xfff:10 --system-wide -k -- /opt/netperf/netperf -t TCP_STREAM -l 10 -H 10.0.1.81 -- -s 262144 -S 262144 -m 4064 > lat4.out ; fgrep MSI /proc/interrupts
------------ lat4.out --------------
TCP STREAM TEST to 10.0.1.81
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
262142 262142 4064 10.00 1562.12
added event set 0
# total_samples 2048
# instruction addr view
# sorted by count
# showing per per distinct value
# %L2 : percentage of L1 misses that hit L2
# %L3 : percentage of L1 misses that hit L3
# %RAM : percentage of L1 misses that hit memory
# L2 : 5 cycles load latency
# L3 : 14 cycles load latency
# sampling period: 100
# #count %self %cum %L2 %L3 %RAM instruction addr
161 7.86% 7.86% 0.00% 99.38% 0.62% 0xa0000001002e38e1 do_csum+0x161<kernel>
137 6.69% 14.55% 57.66% 42.34% 0.00% 0xa0000001002e38e0 do_csum+0x160<kernel>
46 2.25% 16.80% 8.70% 91.30% 0.00% 0xa000000100602930 tcp_transmit_skb+0xb10<kernel>
42 2.05% 18.85% 0.00% 100.00% 0.00% 0xa0000001005e7550 tcp_sendmsg+0x10b0<kernel>
40 1.95% 20.80% 0.00% 95.00% 5.00% 0xa00000010000aa70 rse_clear_invalid+0x60<kernel>
36 1.76% 22.56% 0.00% 22.22% 77.78% 0xa0000001006b3580 _read_lock_bh+0x20<kernel>
35 1.71% 24.27% 22.86% 71.43% 5.71% 0xa0000001002e2540 __copy_user+0x100<kernel>
33 1.61% 25.88% 0.00% 84.85% 15.15% 0xa0000001002e2561 __copy_user+0x121<kernel>
32 1.56% 27.44% 0.00% 90.62% 9.38% 0xa0000001002e2541 __copy_user+0x101<kernel>
31 1.51% 28.96% 3.23% 96.77% 0.00% 0xa0000001005e7560 tcp_sendmsg+0x10c0<kernel>
28 1.37% 30.32% 67.86% 32.14% 0.00% 0xa0000001005bbd51 nf_iterate+0x131<kernel>
25 1.22% 31.54% 0.00% 32.00% 68.00% 0xa0000001006b3061 _spin_lock+0x21<kernel>
23 1.12% 32.67% 4.35% 91.30% 4.35% 0xa0000001002e2560 __copy_user+0x120<kernel>
19 0.93% 33.59% 0.00% 100.00% 0.00% 0xa0000001006b3950 _read_unlock_bh+0x10<kernel>
18 0.88% 34.47% 0.00% 16.67% 83.33% 0xa0000001006b3291 _spin_lock_irqsave+0x31<kernel>
18 0.88% 35.35% 61.11% 38.89% 0.00% 0xa000000100655891 ipt_do_table+0x191<kernel>
17 0.83% 36.18% 100.00% 0.00% 0.00% 0xa0000001005bbcb1 nf_iterate+0x91<kernel>
16 0.78% 36.96% 75.00% 25.00% 0.00% 0xa0000001006557f0 ipt_do_table+0xf0<kernel>
...
# level 0 : counts=669 avg_cycles=0.0ms 32.67%
# level 1 : counts=1230 avg_cycles=0.0ms 60.06%
# level 2 : counts=149 avg_cycles=0.0ms 7.28%
grant
More information about the general
mailing list