<div dir="ltr"><div><div><div><div><div><div><div><div>Hi,<br><br>We have been able to reproduce what we see in our environment using iperf with many parallel threads. We see the TX drop counters increasing for the bonded interface and we get a very occasional "ib0: dev_queue_xmit failed to requeue packet" in dmesg.<br>
<br></div>We seem to be able to squash both these issues by changing IPoIB mode from connected to datagram. When running in connected mode we can see the TX counters increase considerably, but they stop completely when we switch to datagram mode. On a change back to connected, the counters start increasing.<br>
<br></div>I have a few questions around this:<br><br></div>1) I presume it is possible for us to run both connected and datagram mode? If we were to configure one of our publishing hosts to use datagram mode (in anticipation of the TX drops stopping), would the rest of the tcp subscribers running in connected mode continue to see the publishes? If the issue is resolved here, we would intend to change mode to datagram across our estate, but we would like to evaulate the change on a single host. The tests I have been doing with iperf would suggest this is the case, although iperf is continually creating new sockets.<br>
<br></div>2) We were using connected mode primarily due to the larger MTU, meaning we expected the subscribing hosts were under less load with fewer packets, checksuming overhead etc. Are there any other gotchyas on running in datagram mode; or put another way, what benefits does using IPoIB-CM provide and when should it be used? We're currently running a rather large tcp matrix.<br>
<br></div>3) When using IPoIB-CM, does anyone know of any limitations around numbers of active tcp sockets or, say publishing threads? When we made our change to use the ipoib interfaces instead of ethernet, we pretty much immediately started to see page allocation failures, along with the TX drops and failed to requeue packet errors. To me this suggests some kind of contention issue when using connected mode, but this is probably wildly off the mark...<br>
<br></div>Any feedback appriciated.<br><br></div>Cheers,<br></div>-Andrew<br><div><div><div><br><div><div><br><br><br><br></div></div></div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Aug 5, 2013 at 11:28 AM, Andrew McKinney <span dir="ltr"><<a href="mailto:am@sativa.org.uk" target="_blank">am@sativa.org.uk</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div>I'm also seeing tx drops on bond0:<br><br><br>bond0 Link encap:InfiniBand HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 <br>
inet addr:192.168.100.10 Bcast:192.168.100.255 Mask:255.255.255.0<br>
inet6 addr: fe80::202:c903:57:2765/64 Scope:Link<br> UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1<br> RX packets:458926490 errors:0 dropped:1 overruns:0 frame:0<br> TX packets:547157428 errors:0 dropped:30978 overruns:0 carrier:0<br>
collisions:0 txqueuelen:0 <br> RX bytes:18392139044 (17.1 GiB) TX bytes:436943339339 (406.9 GiB)<br><br></div>I don't seem to be able to configure any ring buffers on the ib interfaces using ethtool - is there any other way of doing this?<br>
<br></div>Thanks,<br>-Andrew<br><div><div><br><br></div></div></div><div class=""><div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Aug 5, 2013 at 10:42 AM, Andrew McKinney <span dir="ltr"><<a href="mailto:am@sativa.org.uk" target="_blank">am@sativa.org.uk</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div><div><div>Hi list.<br><br>We're running a TCP middleware over IPoIB-CM (OFED-3.5-2) on Red Hat 6.4. We intend to eventually run a multicast RDMA middleware on the stack.<br>
<br></div>The hardware stack is Dell R720s (some Westmere, mostly Sandy Bridge) with bonded Mellanox MT26428 ConnectX-2 on two QLogc 12300 managed switches. We're runnign the latest firmware on the HCAs and the switches.<br>
<br></div>We have been seeing the following messages in the kernel ring, which also seems to coincide with page allocation errors:<br><br>ib0: dev_queue_xmit failed to requeue packet<br>ib0: dev_queue_xmit failed to requeue packet<br>
ib0: dev_queue_xmit failed to requeue packet<br>ib0: dev_queue_xmit failed to requeue packet<br>ib0: dev_queue_xmit failed to requeue packet<br>ib0: dev_queue_xmit failed to requeue packet<br>java: page allocation failure. order:1, mode:0x20<br>
Pid: 24410, comm: java Tainted: P --------------- 2.6.32-279.el6.x86_64 #1<br>Call Trace:<br> <IRQ> [<ffffffff8112759f>] ? __alloc_pages_nodemask+0x77f/0x940<br> [<ffffffff81489c00>] ? tcp_rcv_established+0x290/0x800<br>
[<ffffffff81161d62>] ? kmem_getpages+0x62/0x170<br> [<ffffffff8116297a>] ? fallback_alloc+0x1ba/0x270<br> [<ffffffff811623cf>] ? cache_grow+0x2cf/0x320<br> [<ffffffff811626f9>] ? ____cache_alloc_node+0x99/0x160<br>
[<ffffffff8143014d>] ? __alloc_skb+0x6d/0x190<br> [<ffffffff811635bf>] ? kmem_cache_alloc_node_notrace+0x6f/0x130<br> [<ffffffff811637fb>] ? __kmalloc_node+0x7b/0x100<br> [<ffffffff8143014d>] ? __alloc_skb+0x6d/0x190<br>
[<ffffffff8143028d>] ? dev_alloc_skb+0x1d/0x40<br> [<ffffffffa0673f90>] ? ipoib_cm_alloc_rx_skb+0x30/0x430 [ib_ipoib]<br> [<ffffffffa067523f>] ? ipoib_cm_handle_rx_wc+0x29f/0x770 [ib_ipoib]<br> [<ffffffffa018c828>] ? mlx4_ib_poll_cq+0xa8/0x890 [mlx4_ib]<br>
[<ffffffffa066c01d>] ? ipoib_ib_completion+0x2d/0x30 [ib_ipoib]<br> [<ffffffffa066d80b>] ? ipoib_poll+0xdb/0x190 [ib_ipoib]<br> [<ffffffff810600bc>] ? try_to_wake_up+0x24c/0x3e0<br> [<ffffffff8143f193>] ? net_rx_action+0x103/0x2f0<br>
[<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0<br> [<ffffffff810db800>] ? handle_IRQ_event+0x60/0x170<br> [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30<br> [<ffffffff8100de85>] ? do_softirq+0x65/0xa0<br>
[<ffffffff81073ca5>] ? irq_exit+0x85/0x90<br> [<ffffffff81505af5>] ? do_IRQ+0x75/0xf0<br> [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11<br> <EOI> <br><br></div>These appear to be genuine drops, as we are seeing gaps in our middleware which is then going on to re-cap.<br>
<br></div>We've just made a change to increase the page cache from ~90M to 128M - but what is the lists feeling on the dev_queue_xmit errors? Could they be being caused by the same issue? Unable to allocate pages in a timely manner perhaps?<br>
<br>We're not running at anywhere near high messages rates (<1000 ~450b mps).<br><br></div>I can see a thread started in 2012 where someone had caused these dev_queue_xmit using netperf and Roland had suggested that at worst one packet was being dropped. Silence after this.<br>
<br></div>Has anyone seen this behavior, or got any pointers to chase this down?<br><br></div>Cheers,<br>-Andrew<br><br></div>ibv_devinfo<br><div><br>ca_id: mlx4_1<br> transport: InfiniBand (0)<br> fw_ver: 2.9.1000<br>
node_guid: 0002:c903:0057:2250<br> sys_image_guid: 0002:c903:0057:2253<br> vendor_id: 0x02c9<br> vendor_part_id: 26428<br> hw_ver: 0xB0<br> board_id: MT_0D90110009<br>
phys_port_cnt: 1<br> max_mr_size: 0xffffffffffffffff<br> page_size_cap: 0xfffffe00<br> max_qp: 163776<br> max_qp_wr: 16351<br> device_cap_flags: 0x007c9c76<br>
max_sge: 32<br> max_sge_rd: 0<br> max_cq: 65408<br> max_cqe: 4194303<br> max_mr: 524272<br> max_pd: 32764<br> max_qp_rd_atom: 16<br>
max_ee_rd_atom: 0<br> max_res_rd_atom: 2620416<br> max_qp_init_rd_atom: 128<br> max_ee_init_rd_atom: 0<br> atomic_cap: ATOMIC_HCA (1)<br> max_ee: 0<br>
max_rdd: 0<br> max_mw: 0<br> max_raw_ipv6_qp: 0<br> max_raw_ethy_qp: 0<br> max_mcast_grp: 8192<br> max_mcast_qp_attach: 248<br> max_total_mcast_qp_attach: 2031616<br>
max_ah: 0<br> max_fmr: 0<br> max_srq: 65472<br> max_srq_wr: 16383<br> max_srq_sge: 31<br> max_pkeys: 128<br> local_ca_ack_delay: 15<br>
port: 1<br> state: PORT_ACTIVE (4)<br> max_mtu: 4096 (5)<br> active_mtu: 2048 (4)<br> sm_lid: 1<br> port_lid: 9<br>
port_lmc: 0x00<br> link_layer: InfiniBand<br> max_msg_sz: 0x40000000<br> port_cap_flags: 0x02510868<br> max_vl_num: 4 (3)<br> bad_pkey_cntr: 0x0<br>
qkey_viol_cntr: 0x0<br> sm_sl: 0<br> pkey_tbl_len: 128<br> gid_tbl_len: 128<br> subnet_timeout: 17<br> init_type_reply: 0<br>
active_width: 4X (2)<br> active_speed: 10.0 Gbps (4)<br> phys_state: LINK_UP (5)<br> GID[ 0]: fe80:0000:0000:0000:0002:c903:0057:2251<br><br>hca_id: mlx4_0<br>
transport: InfiniBand (0)<br> fw_ver: 2.9.1000<br> node_guid: 0002:c903:0057:2764<br> sys_image_guid: 0002:c903:0057:2767<br> vendor_id: 0x02c9<br>
vendor_part_id: 26428<br> hw_ver: 0xB0<br> board_id: MT_0D90110009<br> phys_port_cnt: 1<br> max_mr_size: 0xffffffffffffffff<br> page_size_cap: 0xfffffe00<br>
max_qp: 163776<br> max_qp_wr: 16351<br> device_cap_flags: 0x007c9c76<br> max_sge: 32<br> max_sge_rd: 0<br> max_cq: 65408<br> max_cqe: 4194303<br>
max_mr: 524272<br> max_pd: 32764<br> max_qp_rd_atom: 16<br> max_ee_rd_atom: 0<br> max_res_rd_atom: 2620416<br> max_qp_init_rd_atom: 128<br>
max_ee_init_rd_atom: 0<br> atomic_cap: ATOMIC_HCA (1)<br> max_ee: 0<br> max_rdd: 0<br> max_mw: 0<br> max_raw_ipv6_qp: 0<br> max_raw_ethy_qp: 0<br>
max_mcast_grp: 8192<br> max_mcast_qp_attach: 248<br> max_total_mcast_qp_attach: 2031616<br> max_ah: 0<br> max_fmr: 0<br> max_srq: 65472<br> max_srq_wr: 16383<br>
max_srq_sge: 31<br> max_pkeys: 128<br> local_ca_ack_delay: 15<br> port: 1<br> state: PORT_ACTIVE (4)<br> max_mtu: 4096 (5)<br> active_mtu: 2048 (4)<br>
sm_lid: 1<br> port_lid: 10<br> port_lmc: 0x00<br> link_layer: InfiniBand<br> max_msg_sz: 0x40000000<br> port_cap_flags: 0x02510868<br>
max_vl_num: 4 (3)<br> bad_pkey_cntr: 0x0<br> qkey_viol_cntr: 0x0<br> sm_sl: 0<br> pkey_tbl_len: 128<br> gid_tbl_len: 128<br>
subnet_timeout: 17<br> init_type_reply: 0<br> active_width: 4X (2)<br> active_speed: 10.0 Gbps (4)<br> phys_state: LINK_UP (5)<br> GID[ 0]: fe80:0000:0000:0000:0002:c903:0057:2765<br>
<br><br><div><div><div><div><div>slabtop<br><br> Active / Total Objects (% used) : 3436408 / 5925284 (58.0%)<br> Active / Total Slabs (% used) : 178659 / 178867 (99.9%)<br> Active / Total Caches (% used) : 117 / 193 (60.6%)<br>
Active / Total Size (% used) : 422516.74K / 692339.54K (61.0%)<br> Minimum / Average / Maximum Object : 0.02K / 0.12K / 4096.00K<br><br> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME <br>
4461349 2084881 46% 0.10K 120577 37 482308K buffer_head <br>548064 547979 99% 0.02K 3806 144 15224K avtab_node <br>370496 368197 99% 0.03K 3308 112 13232K size-32 <br>
135534 105374 77% 0.55K 19362 7 77448K radix_tree_node <br> 67946 51531 75% 0.07K 1282 53 5128K selinux_inode_security <br> <a href="tel:57938%C2%A0%2035717" value="+15793835717" target="_blank">57938 35717</a> 61% 0.06K 982 59 3928K size-64 <br>
42620 42303 99% 0.19K 2131 20 8524K dentry <br> <a href="tel:25132%C2%A0%2025129" value="+12513225129" target="_blank">25132 25129</a> 99% 1.00K 6283 4 25132K ext4_inode_cache <br>
23600 23436 99% 0.19K 1180 20 4720K size-192 <br>
18225 18189 99% 0.14K 675 27 2700K sysfs_dir_cache <br> 17062 15025 88% 0.20K 898 19 3592K vm_area_struct <br> 16555 9899 59% 0.05K 215 77 860K anon_vma_chain <br>
15456 15143 97% 0.62K 2576 6 10304K proc_inode_cache <br> 14340 8881 61% 0.19K 717 20 2868K filp <br> 12090 7545 62% 0.12K 403 30 1612K size-128 <br>
10770 8748 81% 0.25K 718 15 2872K skbuff_head_cache <br> 10568 8365 79% 1.00K 2642 4 10568K size-1024 <br> 8924 5464 61% 0.04K 97 92 388K anon_vma <br>
7038 6943 98% 0.58K 1173 6 4692K inode_cache <br> 5192 4956 95% 2.00K 2596 2 10384K size-2048 <br> 3600 3427 95% 0.50K 450 8 1800K size-512 <br>
3498 3105 88% 0.07K 66 53 264K eventpoll_pwq <br> 3390 3105 91% 0.12K 113 30 452K eventpoll_epi <br> 3335 3239 97% 0.69K 667 5 2668K sock_inode_cache <br>
2636 2612 99% 1.62K 659 4 5272K TCP <br> 2380 1962 82% 0.11K 70 34 280K task_delay_info <br> 2310 1951 84% 0.12K 77 30 308K pid <br>
2136 2053 96% 0.44K 267 8 1068K ib_mad <br> 1992 1947 97% 2.59K 664 3 5312K task_struct <br> 1888 1506 79% 0.06K 32 59 128K tcp_bind_bucket <br>
1785 1685 94% 0.25K 119 15 476K size-256 <br> 1743 695 39% 0.50K 249 7 996K skbuff_fclone_cache <br> 1652 532 32% 0.06K 28 59 112K avc_node <br>
1640 1175 71% 0.19K 82 20 328K cred_jar <br> 1456 1264 86% 0.50K 182 8 728K task_xstate <br> 1378 781 56% 0.07K 26 53 104K Acpi-Operand <br>
1156 459 39% 0.11K 34 34 136K jbd2_journal_head <br> 1050 983 93% 0.78K 210 5 840K shmem_inode_cache <br> 1021 879 86% 4.00K 1021 1 4084K size-4096 <br>
1020 537 52% 0.19K 51 20 204K bio-0 <br> 1008 501 49% 0.02K 7 144 28K dm_target_io <br> 920 463 50% 0.04K 10 92 40K dm_io <br>
876 791 90% 1.00K 219 4 876K signal_cache <br> 840 792 94% 2.06K 280 3 2240K sighand_cache <br> 740 439 59% 0.10K 20 37 80K ext4_prealloc_space <br>
736 658 89% 0.04K 8 92 32K Acpi-Namespace <br> 720 283 39% 0.08K 15 48 60K blkdev_ioc <br> 720 294 40% 0.02K 5 144 20K jbd2_journal_handle <br>
708 131 18% 0.06K 12 59 48K fs_cache <br> 630 429 68% 0.38K 63 10 252K ip_dst_cache <br> 627 625 99% 8.00K 627 1 5016K size-8192 <br>
616 297 48% 0.13K 22 28 88K cfq_io_context <br> 480 249 51% 0.23K 30 16 120K cfq_queue <br> 370 330 89% 0.75K 74 5 296K UNIX <br>
368 31 8% 0.04K 4 92 16K khugepaged_mm_slot <br> 357 325 91% 0.53K 51 7 204K idr_layer_cache <br> 341 128 37% 0.69K 31 11 248K files_cache <br>
270 159 58% 0.12K 9 30 36K scsi_sense_cache <br> 246 244 99% 1.81K 123 2 492K TCPv6 <br> 231 131 56% 0.34K 21 11 84K blkdev_requests <br>
210 102 48% 1.38K 42 5 336K mm_struct <br> 210 116 55% 0.25K 14 15 56K sgpool-8 <br> 202 14 6% 0.02K 1 202 4K jbd2_revoke_table <br>
192 192 100% 32.12K 192 1 12288K kmem_cache <br> 180 121 67% 0.25K 12 15 48K scsi_cmd_cache <br> 170 113 66% 0.11K 5 34 20K inotify_inode_mark_entry<br>
144 121 84% 0.16K 6 24 24K sigqueue <br> 134 4 2% 0.05K 2 67 8K ext4_free_block_extents<br> 118 26 22% 0.06K 2 59 8K fib6_nodes <br>
112 2 1% 0.03K 1 112 4K ip_fib_alias <br> 112 1 0% 0.03K 1 112 4K dnotify_struct <br> 112 2 1% 0.03K 1 112 4K sd_ext_cdb <br>
<br></div></div></div></div></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div>