[ewg] IPoIB-CM - ib0: dev_queue_xmit failed to requeue packet

Andrew McKinney am at sativa.org.uk
Tue Aug 6 12:08:06 PDT 2013


Hi,

We have been able to reproduce what we see in our environment using iperf
with many parallel threads. We see the TX drop counters increasing for the
bonded interface and we get a very occasional "ib0: dev_queue_xmit failed
to requeue packet" in dmesg.

We seem to be able to squash both these issues by changing IPoIB mode from
connected to datagram. When running in connected mode we can see the TX
counters increase considerably, but they stop completely when we switch to
datagram mode. On a change back to connected, the counters start increasing.

I have a few questions around this:

1) I presume it is possible for us to run both connected and datagram mode?
If we were to configure one of our publishing hosts to use datagram mode
(in anticipation of the TX drops stopping), would the rest of the tcp
subscribers running in connected mode continue to see the publishes? If the
issue is resolved here, we would intend to change mode to datagram across
our estate, but we would like to evaulate the change on a single host. The
tests I have been doing with iperf would suggest this is the case, although
iperf is continually creating new sockets.

2) We were using connected mode primarily due to the larger MTU, meaning we
expected the subscribing hosts were under less load with fewer packets,
checksuming overhead etc. Are there any other gotchyas on running in
datagram mode; or put another way, what benefits does using IPoIB-CM
provide and when should it be used? We're currently running a rather large
tcp matrix.

3) When using IPoIB-CM, does anyone know of any limitations around numbers
of active tcp sockets or, say publishing threads? When we made our change
to use the ipoib interfaces instead of ethernet, we pretty much immediately
started to see page allocation failures, along with the TX drops and failed
to requeue packet errors. To me this suggests some kind of contention issue
when using connected mode, but this is probably wildly off the mark...

Any feedback appriciated.

Cheers,
-Andrew







On Mon, Aug 5, 2013 at 11:28 AM, Andrew McKinney <am at sativa.org.uk> wrote:

> I'm also seeing tx drops on bond0:
>
>
> bond0     Link encap:InfiniBand  HWaddr
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
>           inet addr:192.168.100.10  Bcast:192.168.100.255
> Mask:255.255.255.0
>           inet6 addr: fe80::202:c903:57:2765/64 Scope:Link
>           UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
>           RX packets:458926490 errors:0 dropped:1 overruns:0 frame:0
>           TX packets:547157428 errors:0 dropped:30978 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:18392139044 (17.1 GiB)  TX bytes:436943339339 (406.9
> GiB)
>
> I don't seem to be able to configure any ring buffers on the ib interfaces
> using ethtool - is there any other way of doing this?
>
> Thanks,
> -Andrew
>
>
>
>
> On Mon, Aug 5, 2013 at 10:42 AM, Andrew McKinney <am at sativa.org.uk> wrote:
>
>> Hi list.
>>
>> We're running a TCP middleware over IPoIB-CM (OFED-3.5-2) on Red Hat 6.4.
>> We intend to eventually run a multicast RDMA middleware on the stack.
>>
>> The hardware stack is Dell R720s (some Westmere, mostly Sandy Bridge)
>> with bonded Mellanox MT26428 ConnectX-2 on two QLogc 12300 managed
>> switches. We're runnign the latest firmware on the HCAs and the switches.
>>
>> We have been seeing the following messages in the kernel ring, which also
>> seems to coincide with page allocation errors:
>>
>> ib0: dev_queue_xmit failed to requeue packet
>> ib0: dev_queue_xmit failed to requeue packet
>> ib0: dev_queue_xmit failed to requeue packet
>> ib0: dev_queue_xmit failed to requeue packet
>> ib0: dev_queue_xmit failed to requeue packet
>> ib0: dev_queue_xmit failed to requeue packet
>> java: page allocation failure. order:1, mode:0x20
>> Pid: 24410, comm: java Tainted: P           ---------------
>> 2.6.32-279.el6.x86_64 #1
>> Call Trace:
>>  <IRQ>  [<ffffffff8112759f>] ? __alloc_pages_nodemask+0x77f/0x940
>>  [<ffffffff81489c00>] ? tcp_rcv_established+0x290/0x800
>>  [<ffffffff81161d62>] ? kmem_getpages+0x62/0x170
>>  [<ffffffff8116297a>] ? fallback_alloc+0x1ba/0x270
>>  [<ffffffff811623cf>] ? cache_grow+0x2cf/0x320
>>  [<ffffffff811626f9>] ? ____cache_alloc_node+0x99/0x160
>>  [<ffffffff8143014d>] ? __alloc_skb+0x6d/0x190
>>  [<ffffffff811635bf>] ? kmem_cache_alloc_node_notrace+0x6f/0x130
>>  [<ffffffff811637fb>] ? __kmalloc_node+0x7b/0x100
>>  [<ffffffff8143014d>] ? __alloc_skb+0x6d/0x190
>>  [<ffffffff8143028d>] ? dev_alloc_skb+0x1d/0x40
>>  [<ffffffffa0673f90>] ? ipoib_cm_alloc_rx_skb+0x30/0x430 [ib_ipoib]
>>  [<ffffffffa067523f>] ? ipoib_cm_handle_rx_wc+0x29f/0x770 [ib_ipoib]
>>  [<ffffffffa018c828>] ? mlx4_ib_poll_cq+0xa8/0x890 [mlx4_ib]
>>  [<ffffffffa066c01d>] ? ipoib_ib_completion+0x2d/0x30 [ib_ipoib]
>>  [<ffffffffa066d80b>] ? ipoib_poll+0xdb/0x190 [ib_ipoib]
>>  [<ffffffff810600bc>] ? try_to_wake_up+0x24c/0x3e0
>>  [<ffffffff8143f193>] ? net_rx_action+0x103/0x2f0
>>  [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0
>>  [<ffffffff810db800>] ? handle_IRQ_event+0x60/0x170
>>  [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30
>>  [<ffffffff8100de85>] ? do_softirq+0x65/0xa0
>>  [<ffffffff81073ca5>] ? irq_exit+0x85/0x90
>>  [<ffffffff81505af5>] ? do_IRQ+0x75/0xf0
>>  [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11
>>  <EOI>
>>
>> These appear to be genuine drops, as we are seeing gaps in our middleware
>> which is then going on to re-cap.
>>
>> We've just made a change to increase the page cache from ~90M to 128M -
>> but what is the lists feeling on the dev_queue_xmit errors? Could they be
>> being caused by the same issue? Unable to allocate pages in a timely manner
>> perhaps?
>>
>> We're not running at anywhere near high messages rates (<1000 ~450b mps).
>>
>> I can see a thread started in 2012 where someone had caused these
>> dev_queue_xmit using netperf and Roland had suggested that at worst one
>> packet was being dropped. Silence after this.
>>
>> Has anyone seen this behavior, or got any pointers to chase this down?
>>
>> Cheers,
>> -Andrew
>>
>> ibv_devinfo
>>
>> ca_id:    mlx4_1
>>     transport:            InfiniBand (0)
>>     fw_ver:                2.9.1000
>>     node_guid:            0002:c903:0057:2250
>>     sys_image_guid:            0002:c903:0057:2253
>>     vendor_id:            0x02c9
>>     vendor_part_id:            26428
>>     hw_ver:                0xB0
>>     board_id:            MT_0D90110009
>>     phys_port_cnt:            1
>>     max_mr_size:            0xffffffffffffffff
>>     page_size_cap:            0xfffffe00
>>     max_qp:                163776
>>     max_qp_wr:            16351
>>     device_cap_flags:        0x007c9c76
>>     max_sge:            32
>>     max_sge_rd:            0
>>     max_cq:                65408
>>     max_cqe:            4194303
>>     max_mr:                524272
>>     max_pd:                32764
>>     max_qp_rd_atom:            16
>>     max_ee_rd_atom:            0
>>     max_res_rd_atom:        2620416
>>     max_qp_init_rd_atom:        128
>>     max_ee_init_rd_atom:        0
>>     atomic_cap:            ATOMIC_HCA (1)
>>     max_ee:                0
>>     max_rdd:            0
>>     max_mw:                0
>>     max_raw_ipv6_qp:        0
>>     max_raw_ethy_qp:        0
>>     max_mcast_grp:            8192
>>     max_mcast_qp_attach:        248
>>     max_total_mcast_qp_attach:    2031616
>>     max_ah:                0
>>     max_fmr:            0
>>     max_srq:            65472
>>     max_srq_wr:            16383
>>     max_srq_sge:            31
>>     max_pkeys:            128
>>     local_ca_ack_delay:        15
>>         port:    1
>>             state:            PORT_ACTIVE (4)
>>             max_mtu:        4096 (5)
>>             active_mtu:        2048 (4)
>>             sm_lid:            1
>>             port_lid:        9
>>             port_lmc:        0x00
>>             link_layer:        InfiniBand
>>             max_msg_sz:        0x40000000
>>             port_cap_flags:        0x02510868
>>             max_vl_num:        4 (3)
>>             bad_pkey_cntr:        0x0
>>             qkey_viol_cntr:        0x0
>>             sm_sl:            0
>>             pkey_tbl_len:        128
>>             gid_tbl_len:        128
>>             subnet_timeout:        17
>>             init_type_reply:    0
>>             active_width:        4X (2)
>>             active_speed:        10.0 Gbps (4)
>>             phys_state:        LINK_UP (5)
>>             GID[  0]:        fe80:0000:0000:0000:0002:c903:0057:2251
>>
>> hca_id:    mlx4_0
>>     transport:            InfiniBand (0)
>>     fw_ver:                2.9.1000
>>     node_guid:            0002:c903:0057:2764
>>     sys_image_guid:            0002:c903:0057:2767
>>     vendor_id:            0x02c9
>>     vendor_part_id:            26428
>>     hw_ver:                0xB0
>>     board_id:            MT_0D90110009
>>     phys_port_cnt:            1
>>     max_mr_size:            0xffffffffffffffff
>>     page_size_cap:            0xfffffe00
>>     max_qp:                163776
>>     max_qp_wr:            16351
>>     device_cap_flags:        0x007c9c76
>>     max_sge:            32
>>     max_sge_rd:            0
>>     max_cq:                65408
>>     max_cqe:            4194303
>>     max_mr:                524272
>>     max_pd:                32764
>>     max_qp_rd_atom:            16
>>     max_ee_rd_atom:            0
>>     max_res_rd_atom:        2620416
>>     max_qp_init_rd_atom:        128
>>     max_ee_init_rd_atom:        0
>>     atomic_cap:            ATOMIC_HCA (1)
>>     max_ee:                0
>>     max_rdd:            0
>>     max_mw:                0
>>     max_raw_ipv6_qp:        0
>>     max_raw_ethy_qp:        0
>>     max_mcast_grp:            8192
>>     max_mcast_qp_attach:        248
>>     max_total_mcast_qp_attach:    2031616
>>     max_ah:                0
>>     max_fmr:            0
>>     max_srq:            65472
>>     max_srq_wr:            16383
>>     max_srq_sge:            31
>>     max_pkeys:            128
>>     local_ca_ack_delay:        15
>>         port:    1
>>             state:            PORT_ACTIVE (4)
>>             max_mtu:        4096 (5)
>>             active_mtu:        2048 (4)
>>             sm_lid:            1
>>             port_lid:        10
>>             port_lmc:        0x00
>>             link_layer:        InfiniBand
>>             max_msg_sz:        0x40000000
>>             port_cap_flags:        0x02510868
>>             max_vl_num:        4 (3)
>>             bad_pkey_cntr:        0x0
>>             qkey_viol_cntr:        0x0
>>             sm_sl:            0
>>             pkey_tbl_len:        128
>>             gid_tbl_len:        128
>>             subnet_timeout:        17
>>             init_type_reply:    0
>>             active_width:        4X (2)
>>             active_speed:        10.0 Gbps (4)
>>             phys_state:        LINK_UP (5)
>>             GID[  0]:        fe80:0000:0000:0000:0002:c903:0057:2765
>>
>>
>> slabtop
>>
>>  Active / Total Objects (% used)    : 3436408 / 5925284 (58.0%)
>>  Active / Total Slabs (% used)      : 178659 / 178867 (99.9%)
>>  Active / Total Caches (% used)     : 117 / 193 (60.6%)
>>  Active / Total Size (% used)       : 422516.74K / 692339.54K (61.0%)
>>  Minimum / Average / Maximum Object : 0.02K / 0.12K / 4096.00K
>>
>>   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE
>> NAME
>> 4461349 2084881  46%    0.10K 120577       37    482308K
>> buffer_head
>> 548064 547979  99%    0.02K   3806      144     15224K
>> avtab_node
>> 370496 368197  99%    0.03K   3308      112     13232K
>> size-32
>> 135534 105374  77%    0.55K  19362        7     77448K
>> radix_tree_node
>>  67946  51531  75%    0.07K   1282       53      5128K
>> selinux_inode_security
>>  57938  35717  61%    0.06K    982       59      3928K
>> size-64
>>  42620  42303  99%    0.19K   2131       20      8524K
>> dentry
>>  25132  25129  99%    1.00K   6283        4     25132K
>> ext4_inode_cache
>>  23600  23436  99%    0.19K   1180       20      4720K
>> size-192
>>  18225  18189  99%    0.14K    675       27      2700K
>> sysfs_dir_cache
>>  17062  15025  88%    0.20K    898       19      3592K
>> vm_area_struct
>>  16555   9899  59%    0.05K    215       77       860K
>> anon_vma_chain
>>  15456  15143  97%    0.62K   2576        6     10304K
>> proc_inode_cache
>>  14340   8881  61%    0.19K    717       20      2868K
>> filp
>>  12090   7545  62%    0.12K    403       30      1612K
>> size-128
>>  10770   8748  81%    0.25K    718       15      2872K
>> skbuff_head_cache
>>  10568   8365  79%    1.00K   2642        4     10568K
>> size-1024
>>   8924   5464  61%    0.04K     97       92       388K
>> anon_vma
>>   7038   6943  98%    0.58K   1173        6      4692K
>> inode_cache
>>   5192   4956  95%    2.00K   2596        2     10384K
>> size-2048
>>   3600   3427  95%    0.50K    450        8      1800K
>> size-512
>>   3498   3105  88%    0.07K     66       53       264K
>> eventpoll_pwq
>>   3390   3105  91%    0.12K    113       30       452K
>> eventpoll_epi
>>   3335   3239  97%    0.69K    667        5      2668K
>> sock_inode_cache
>>   2636   2612  99%    1.62K    659        4      5272K
>> TCP
>>   2380   1962  82%    0.11K     70       34       280K
>> task_delay_info
>>   2310   1951  84%    0.12K     77       30       308K
>> pid
>>   2136   2053  96%    0.44K    267        8      1068K
>> ib_mad
>>   1992   1947  97%    2.59K    664        3      5312K
>> task_struct
>>   1888   1506  79%    0.06K     32       59       128K
>> tcp_bind_bucket
>>   1785   1685  94%    0.25K    119       15       476K
>> size-256
>>   1743    695  39%    0.50K    249        7       996K
>> skbuff_fclone_cache
>>   1652    532  32%    0.06K     28       59       112K
>> avc_node
>>   1640   1175  71%    0.19K     82       20       328K
>> cred_jar
>>   1456   1264  86%    0.50K    182        8       728K
>> task_xstate
>>   1378    781  56%    0.07K     26       53       104K
>> Acpi-Operand
>>   1156    459  39%    0.11K     34       34       136K
>> jbd2_journal_head
>>   1050    983  93%    0.78K    210        5       840K
>> shmem_inode_cache
>>   1021    879  86%    4.00K   1021        1      4084K
>> size-4096
>>   1020    537  52%    0.19K     51       20       204K
>> bio-0
>>   1008    501  49%    0.02K      7      144        28K
>> dm_target_io
>>    920    463  50%    0.04K     10       92        40K
>> dm_io
>>    876    791  90%    1.00K    219        4       876K
>> signal_cache
>>    840    792  94%    2.06K    280        3      2240K
>> sighand_cache
>>    740    439  59%    0.10K     20       37        80K
>> ext4_prealloc_space
>>    736    658  89%    0.04K      8       92        32K
>> Acpi-Namespace
>>    720    283  39%    0.08K     15       48        60K
>> blkdev_ioc
>>    720    294  40%    0.02K      5      144        20K
>> jbd2_journal_handle
>>    708    131  18%    0.06K     12       59        48K
>> fs_cache
>>    630    429  68%    0.38K     63       10       252K
>> ip_dst_cache
>>    627    625  99%    8.00K    627        1      5016K
>> size-8192
>>    616    297  48%    0.13K     22       28        88K
>> cfq_io_context
>>    480    249  51%    0.23K     30       16       120K
>> cfq_queue
>>    370    330  89%    0.75K     74        5       296K
>> UNIX
>>    368     31   8%    0.04K      4       92        16K
>> khugepaged_mm_slot
>>    357    325  91%    0.53K     51        7       204K
>> idr_layer_cache
>>    341    128  37%    0.69K     31       11       248K
>> files_cache
>>    270    159  58%    0.12K      9       30        36K
>> scsi_sense_cache
>>    246    244  99%    1.81K    123        2       492K
>> TCPv6
>>    231    131  56%    0.34K     21       11        84K
>> blkdev_requests
>>    210    102  48%    1.38K     42        5       336K
>> mm_struct
>>    210    116  55%    0.25K     14       15        56K
>> sgpool-8
>>    202     14   6%    0.02K      1      202         4K
>> jbd2_revoke_table
>>    192    192 100%   32.12K    192        1     12288K
>> kmem_cache
>>    180    121  67%    0.25K     12       15        48K
>> scsi_cmd_cache
>>    170    113  66%    0.11K      5       34        20K
>> inotify_inode_mark_entry
>>    144    121  84%    0.16K      6       24        24K
>> sigqueue
>>    134      4   2%    0.05K      2       67         8K
>> ext4_free_block_extents
>>    118     26  22%    0.06K      2       59         8K
>> fib6_nodes
>>    112      2   1%    0.03K      1      112         4K
>> ip_fib_alias
>>    112      1   0%    0.03K      1      112         4K
>> dnotify_struct
>>    112      2   1%    0.03K      1      112         4K
>> sd_ext_cdb
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/ewg/attachments/20130806/b147b565/attachment.html>


More information about the ewg mailing list