[ewg] IPoIB-CM - ib0: dev_queue_xmit failed to requeue packet
Andrew McKinney
am at sativa.org.uk
Tue Aug 6 12:08:06 PDT 2013
Hi,
We have been able to reproduce what we see in our environment using iperf
with many parallel threads. We see the TX drop counters increasing for the
bonded interface and we get a very occasional "ib0: dev_queue_xmit failed
to requeue packet" in dmesg.
We seem to be able to squash both these issues by changing IPoIB mode from
connected to datagram. When running in connected mode we can see the TX
counters increase considerably, but they stop completely when we switch to
datagram mode. On a change back to connected, the counters start increasing.
I have a few questions around this:
1) I presume it is possible for us to run both connected and datagram mode?
If we were to configure one of our publishing hosts to use datagram mode
(in anticipation of the TX drops stopping), would the rest of the tcp
subscribers running in connected mode continue to see the publishes? If the
issue is resolved here, we would intend to change mode to datagram across
our estate, but we would like to evaulate the change on a single host. The
tests I have been doing with iperf would suggest this is the case, although
iperf is continually creating new sockets.
2) We were using connected mode primarily due to the larger MTU, meaning we
expected the subscribing hosts were under less load with fewer packets,
checksuming overhead etc. Are there any other gotchyas on running in
datagram mode; or put another way, what benefits does using IPoIB-CM
provide and when should it be used? We're currently running a rather large
tcp matrix.
3) When using IPoIB-CM, does anyone know of any limitations around numbers
of active tcp sockets or, say publishing threads? When we made our change
to use the ipoib interfaces instead of ethernet, we pretty much immediately
started to see page allocation failures, along with the TX drops and failed
to requeue packet errors. To me this suggests some kind of contention issue
when using connected mode, but this is probably wildly off the mark...
Any feedback appriciated.
Cheers,
-Andrew
On Mon, Aug 5, 2013 at 11:28 AM, Andrew McKinney <am at sativa.org.uk> wrote:
> I'm also seeing tx drops on bond0:
>
>
> bond0 Link encap:InfiniBand HWaddr
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
> inet addr:192.168.100.10 Bcast:192.168.100.255
> Mask:255.255.255.0
> inet6 addr: fe80::202:c903:57:2765/64 Scope:Link
> UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
> RX packets:458926490 errors:0 dropped:1 overruns:0 frame:0
> TX packets:547157428 errors:0 dropped:30978 overruns:0 carrier:0
> collisions:0 txqueuelen:0
> RX bytes:18392139044 (17.1 GiB) TX bytes:436943339339 (406.9
> GiB)
>
> I don't seem to be able to configure any ring buffers on the ib interfaces
> using ethtool - is there any other way of doing this?
>
> Thanks,
> -Andrew
>
>
>
>
> On Mon, Aug 5, 2013 at 10:42 AM, Andrew McKinney <am at sativa.org.uk> wrote:
>
>> Hi list.
>>
>> We're running a TCP middleware over IPoIB-CM (OFED-3.5-2) on Red Hat 6.4.
>> We intend to eventually run a multicast RDMA middleware on the stack.
>>
>> The hardware stack is Dell R720s (some Westmere, mostly Sandy Bridge)
>> with bonded Mellanox MT26428 ConnectX-2 on two QLogc 12300 managed
>> switches. We're runnign the latest firmware on the HCAs and the switches.
>>
>> We have been seeing the following messages in the kernel ring, which also
>> seems to coincide with page allocation errors:
>>
>> ib0: dev_queue_xmit failed to requeue packet
>> ib0: dev_queue_xmit failed to requeue packet
>> ib0: dev_queue_xmit failed to requeue packet
>> ib0: dev_queue_xmit failed to requeue packet
>> ib0: dev_queue_xmit failed to requeue packet
>> ib0: dev_queue_xmit failed to requeue packet
>> java: page allocation failure. order:1, mode:0x20
>> Pid: 24410, comm: java Tainted: P ---------------
>> 2.6.32-279.el6.x86_64 #1
>> Call Trace:
>> <IRQ> [<ffffffff8112759f>] ? __alloc_pages_nodemask+0x77f/0x940
>> [<ffffffff81489c00>] ? tcp_rcv_established+0x290/0x800
>> [<ffffffff81161d62>] ? kmem_getpages+0x62/0x170
>> [<ffffffff8116297a>] ? fallback_alloc+0x1ba/0x270
>> [<ffffffff811623cf>] ? cache_grow+0x2cf/0x320
>> [<ffffffff811626f9>] ? ____cache_alloc_node+0x99/0x160
>> [<ffffffff8143014d>] ? __alloc_skb+0x6d/0x190
>> [<ffffffff811635bf>] ? kmem_cache_alloc_node_notrace+0x6f/0x130
>> [<ffffffff811637fb>] ? __kmalloc_node+0x7b/0x100
>> [<ffffffff8143014d>] ? __alloc_skb+0x6d/0x190
>> [<ffffffff8143028d>] ? dev_alloc_skb+0x1d/0x40
>> [<ffffffffa0673f90>] ? ipoib_cm_alloc_rx_skb+0x30/0x430 [ib_ipoib]
>> [<ffffffffa067523f>] ? ipoib_cm_handle_rx_wc+0x29f/0x770 [ib_ipoib]
>> [<ffffffffa018c828>] ? mlx4_ib_poll_cq+0xa8/0x890 [mlx4_ib]
>> [<ffffffffa066c01d>] ? ipoib_ib_completion+0x2d/0x30 [ib_ipoib]
>> [<ffffffffa066d80b>] ? ipoib_poll+0xdb/0x190 [ib_ipoib]
>> [<ffffffff810600bc>] ? try_to_wake_up+0x24c/0x3e0
>> [<ffffffff8143f193>] ? net_rx_action+0x103/0x2f0
>> [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0
>> [<ffffffff810db800>] ? handle_IRQ_event+0x60/0x170
>> [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
>> [<ffffffff8100de85>] ? do_softirq+0x65/0xa0
>> [<ffffffff81073ca5>] ? irq_exit+0x85/0x90
>> [<ffffffff81505af5>] ? do_IRQ+0x75/0xf0
>> [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11
>> <EOI>
>>
>> These appear to be genuine drops, as we are seeing gaps in our middleware
>> which is then going on to re-cap.
>>
>> We've just made a change to increase the page cache from ~90M to 128M -
>> but what is the lists feeling on the dev_queue_xmit errors? Could they be
>> being caused by the same issue? Unable to allocate pages in a timely manner
>> perhaps?
>>
>> We're not running at anywhere near high messages rates (<1000 ~450b mps).
>>
>> I can see a thread started in 2012 where someone had caused these
>> dev_queue_xmit using netperf and Roland had suggested that at worst one
>> packet was being dropped. Silence after this.
>>
>> Has anyone seen this behavior, or got any pointers to chase this down?
>>
>> Cheers,
>> -Andrew
>>
>> ibv_devinfo
>>
>> ca_id: mlx4_1
>> transport: InfiniBand (0)
>> fw_ver: 2.9.1000
>> node_guid: 0002:c903:0057:2250
>> sys_image_guid: 0002:c903:0057:2253
>> vendor_id: 0x02c9
>> vendor_part_id: 26428
>> hw_ver: 0xB0
>> board_id: MT_0D90110009
>> phys_port_cnt: 1
>> max_mr_size: 0xffffffffffffffff
>> page_size_cap: 0xfffffe00
>> max_qp: 163776
>> max_qp_wr: 16351
>> device_cap_flags: 0x007c9c76
>> max_sge: 32
>> max_sge_rd: 0
>> max_cq: 65408
>> max_cqe: 4194303
>> max_mr: 524272
>> max_pd: 32764
>> max_qp_rd_atom: 16
>> max_ee_rd_atom: 0
>> max_res_rd_atom: 2620416
>> max_qp_init_rd_atom: 128
>> max_ee_init_rd_atom: 0
>> atomic_cap: ATOMIC_HCA (1)
>> max_ee: 0
>> max_rdd: 0
>> max_mw: 0
>> max_raw_ipv6_qp: 0
>> max_raw_ethy_qp: 0
>> max_mcast_grp: 8192
>> max_mcast_qp_attach: 248
>> max_total_mcast_qp_attach: 2031616
>> max_ah: 0
>> max_fmr: 0
>> max_srq: 65472
>> max_srq_wr: 16383
>> max_srq_sge: 31
>> max_pkeys: 128
>> local_ca_ack_delay: 15
>> port: 1
>> state: PORT_ACTIVE (4)
>> max_mtu: 4096 (5)
>> active_mtu: 2048 (4)
>> sm_lid: 1
>> port_lid: 9
>> port_lmc: 0x00
>> link_layer: InfiniBand
>> max_msg_sz: 0x40000000
>> port_cap_flags: 0x02510868
>> max_vl_num: 4 (3)
>> bad_pkey_cntr: 0x0
>> qkey_viol_cntr: 0x0
>> sm_sl: 0
>> pkey_tbl_len: 128
>> gid_tbl_len: 128
>> subnet_timeout: 17
>> init_type_reply: 0
>> active_width: 4X (2)
>> active_speed: 10.0 Gbps (4)
>> phys_state: LINK_UP (5)
>> GID[ 0]: fe80:0000:0000:0000:0002:c903:0057:2251
>>
>> hca_id: mlx4_0
>> transport: InfiniBand (0)
>> fw_ver: 2.9.1000
>> node_guid: 0002:c903:0057:2764
>> sys_image_guid: 0002:c903:0057:2767
>> vendor_id: 0x02c9
>> vendor_part_id: 26428
>> hw_ver: 0xB0
>> board_id: MT_0D90110009
>> phys_port_cnt: 1
>> max_mr_size: 0xffffffffffffffff
>> page_size_cap: 0xfffffe00
>> max_qp: 163776
>> max_qp_wr: 16351
>> device_cap_flags: 0x007c9c76
>> max_sge: 32
>> max_sge_rd: 0
>> max_cq: 65408
>> max_cqe: 4194303
>> max_mr: 524272
>> max_pd: 32764
>> max_qp_rd_atom: 16
>> max_ee_rd_atom: 0
>> max_res_rd_atom: 2620416
>> max_qp_init_rd_atom: 128
>> max_ee_init_rd_atom: 0
>> atomic_cap: ATOMIC_HCA (1)
>> max_ee: 0
>> max_rdd: 0
>> max_mw: 0
>> max_raw_ipv6_qp: 0
>> max_raw_ethy_qp: 0
>> max_mcast_grp: 8192
>> max_mcast_qp_attach: 248
>> max_total_mcast_qp_attach: 2031616
>> max_ah: 0
>> max_fmr: 0
>> max_srq: 65472
>> max_srq_wr: 16383
>> max_srq_sge: 31
>> max_pkeys: 128
>> local_ca_ack_delay: 15
>> port: 1
>> state: PORT_ACTIVE (4)
>> max_mtu: 4096 (5)
>> active_mtu: 2048 (4)
>> sm_lid: 1
>> port_lid: 10
>> port_lmc: 0x00
>> link_layer: InfiniBand
>> max_msg_sz: 0x40000000
>> port_cap_flags: 0x02510868
>> max_vl_num: 4 (3)
>> bad_pkey_cntr: 0x0
>> qkey_viol_cntr: 0x0
>> sm_sl: 0
>> pkey_tbl_len: 128
>> gid_tbl_len: 128
>> subnet_timeout: 17
>> init_type_reply: 0
>> active_width: 4X (2)
>> active_speed: 10.0 Gbps (4)
>> phys_state: LINK_UP (5)
>> GID[ 0]: fe80:0000:0000:0000:0002:c903:0057:2765
>>
>>
>> slabtop
>>
>> Active / Total Objects (% used) : 3436408 / 5925284 (58.0%)
>> Active / Total Slabs (% used) : 178659 / 178867 (99.9%)
>> Active / Total Caches (% used) : 117 / 193 (60.6%)
>> Active / Total Size (% used) : 422516.74K / 692339.54K (61.0%)
>> Minimum / Average / Maximum Object : 0.02K / 0.12K / 4096.00K
>>
>> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE
>> NAME
>> 4461349 2084881 46% 0.10K 120577 37 482308K
>> buffer_head
>> 548064 547979 99% 0.02K 3806 144 15224K
>> avtab_node
>> 370496 368197 99% 0.03K 3308 112 13232K
>> size-32
>> 135534 105374 77% 0.55K 19362 7 77448K
>> radix_tree_node
>> 67946 51531 75% 0.07K 1282 53 5128K
>> selinux_inode_security
>> 57938 35717 61% 0.06K 982 59 3928K
>> size-64
>> 42620 42303 99% 0.19K 2131 20 8524K
>> dentry
>> 25132 25129 99% 1.00K 6283 4 25132K
>> ext4_inode_cache
>> 23600 23436 99% 0.19K 1180 20 4720K
>> size-192
>> 18225 18189 99% 0.14K 675 27 2700K
>> sysfs_dir_cache
>> 17062 15025 88% 0.20K 898 19 3592K
>> vm_area_struct
>> 16555 9899 59% 0.05K 215 77 860K
>> anon_vma_chain
>> 15456 15143 97% 0.62K 2576 6 10304K
>> proc_inode_cache
>> 14340 8881 61% 0.19K 717 20 2868K
>> filp
>> 12090 7545 62% 0.12K 403 30 1612K
>> size-128
>> 10770 8748 81% 0.25K 718 15 2872K
>> skbuff_head_cache
>> 10568 8365 79% 1.00K 2642 4 10568K
>> size-1024
>> 8924 5464 61% 0.04K 97 92 388K
>> anon_vma
>> 7038 6943 98% 0.58K 1173 6 4692K
>> inode_cache
>> 5192 4956 95% 2.00K 2596 2 10384K
>> size-2048
>> 3600 3427 95% 0.50K 450 8 1800K
>> size-512
>> 3498 3105 88% 0.07K 66 53 264K
>> eventpoll_pwq
>> 3390 3105 91% 0.12K 113 30 452K
>> eventpoll_epi
>> 3335 3239 97% 0.69K 667 5 2668K
>> sock_inode_cache
>> 2636 2612 99% 1.62K 659 4 5272K
>> TCP
>> 2380 1962 82% 0.11K 70 34 280K
>> task_delay_info
>> 2310 1951 84% 0.12K 77 30 308K
>> pid
>> 2136 2053 96% 0.44K 267 8 1068K
>> ib_mad
>> 1992 1947 97% 2.59K 664 3 5312K
>> task_struct
>> 1888 1506 79% 0.06K 32 59 128K
>> tcp_bind_bucket
>> 1785 1685 94% 0.25K 119 15 476K
>> size-256
>> 1743 695 39% 0.50K 249 7 996K
>> skbuff_fclone_cache
>> 1652 532 32% 0.06K 28 59 112K
>> avc_node
>> 1640 1175 71% 0.19K 82 20 328K
>> cred_jar
>> 1456 1264 86% 0.50K 182 8 728K
>> task_xstate
>> 1378 781 56% 0.07K 26 53 104K
>> Acpi-Operand
>> 1156 459 39% 0.11K 34 34 136K
>> jbd2_journal_head
>> 1050 983 93% 0.78K 210 5 840K
>> shmem_inode_cache
>> 1021 879 86% 4.00K 1021 1 4084K
>> size-4096
>> 1020 537 52% 0.19K 51 20 204K
>> bio-0
>> 1008 501 49% 0.02K 7 144 28K
>> dm_target_io
>> 920 463 50% 0.04K 10 92 40K
>> dm_io
>> 876 791 90% 1.00K 219 4 876K
>> signal_cache
>> 840 792 94% 2.06K 280 3 2240K
>> sighand_cache
>> 740 439 59% 0.10K 20 37 80K
>> ext4_prealloc_space
>> 736 658 89% 0.04K 8 92 32K
>> Acpi-Namespace
>> 720 283 39% 0.08K 15 48 60K
>> blkdev_ioc
>> 720 294 40% 0.02K 5 144 20K
>> jbd2_journal_handle
>> 708 131 18% 0.06K 12 59 48K
>> fs_cache
>> 630 429 68% 0.38K 63 10 252K
>> ip_dst_cache
>> 627 625 99% 8.00K 627 1 5016K
>> size-8192
>> 616 297 48% 0.13K 22 28 88K
>> cfq_io_context
>> 480 249 51% 0.23K 30 16 120K
>> cfq_queue
>> 370 330 89% 0.75K 74 5 296K
>> UNIX
>> 368 31 8% 0.04K 4 92 16K
>> khugepaged_mm_slot
>> 357 325 91% 0.53K 51 7 204K
>> idr_layer_cache
>> 341 128 37% 0.69K 31 11 248K
>> files_cache
>> 270 159 58% 0.12K 9 30 36K
>> scsi_sense_cache
>> 246 244 99% 1.81K 123 2 492K
>> TCPv6
>> 231 131 56% 0.34K 21 11 84K
>> blkdev_requests
>> 210 102 48% 1.38K 42 5 336K
>> mm_struct
>> 210 116 55% 0.25K 14 15 56K
>> sgpool-8
>> 202 14 6% 0.02K 1 202 4K
>> jbd2_revoke_table
>> 192 192 100% 32.12K 192 1 12288K
>> kmem_cache
>> 180 121 67% 0.25K 12 15 48K
>> scsi_cmd_cache
>> 170 113 66% 0.11K 5 34 20K
>> inotify_inode_mark_entry
>> 144 121 84% 0.16K 6 24 24K
>> sigqueue
>> 134 4 2% 0.05K 2 67 8K
>> ext4_free_block_extents
>> 118 26 22% 0.06K 2 59 8K
>> fib6_nodes
>> 112 2 1% 0.03K 1 112 4K
>> ip_fib_alias
>> 112 1 0% 0.03K 1 112 4K
>> dnotify_struct
>> 112 2 1% 0.03K 1 112 4K
>> sd_ext_cdb
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20130806/b147b565/attachment.html>
More information about the ewg
mailing list