<div dir="ltr"><div><div><div><div><div><div><div><div>Hi,<br><br>We have been able to reproduce what we see in our environment using iperf with many parallel threads. We see the TX drop counters increasing for the bonded interface and we get a very occasional "ib0: dev_queue_xmit failed to requeue packet" in dmesg.<br>


<br></div>We seem to be able to squash both these issues by changing IPoIB mode from connected to datagram. When running in connected mode we can see the TX counters increase considerably, but they stop completely when we switch to datagram mode. On a change back to connected, the counters start increasing.<br>


<br></div>I have a few questions around this:<br><br></div>1) I presume it is possible for us to run both connected and datagram mode? If we were to configure one of our publishing hosts to use datagram mode (in anticipation of the TX drops stopping), would the rest of the tcp subscribers running in connected mode continue to see the publishes? If the issue is resolved here, we would intend to change mode to datagram across our estate, but we would like to evaulate the change on a single host. The tests I have been doing with iperf would suggest this is the case, although iperf is continually creating new sockets.<br>


<br></div>2) We were using connected mode primarily due to the larger MTU, meaning we expected the subscribing hosts were under less load with fewer packets, checksuming overhead etc. Are there any other gotchyas on running in datagram mode; or put another way, what benefits does using IPoIB-CM provide and when should it be used? We're currently running a rather large tcp matrix.<br>


<br></div>3) When using IPoIB-CM, does anyone know of any limitations around numbers of active tcp sockets or, say publishing threads? When we made our change to use the ipoib interfaces instead of ethernet, we pretty much immediately started to see page allocation failures, along with the TX drops and failed to requeue packet errors. To me this suggests some kind of contention issue when using connected mode, but this is probably wildly off the mark...<br>


<br></div>Any feedback appriciated.<br><br></div>Cheers,<br></div>-Andrew<br><div><div><div><br><div><div><br><br><br><br></div></div></div></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Aug 5, 2013 at 11:28 AM, Andrew McKinney <span dir="ltr"><<a href="mailto:am@sativa.org.uk" target="_blank">am@sativa.org.uk</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div>I'm also seeing tx drops on bond0:<br><br><br>bond0     Link encap:InfiniBand  HWaddr 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00  <br>


          inet addr:192.168.100.10  Bcast:192.168.100.255  Mask:255.255.255.0<br>


          inet6 addr: fe80::202:c903:57:2765/64 Scope:Link<br>          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1<br>          RX packets:458926490 errors:0 dropped:1 overruns:0 frame:0<br>          TX packets:547157428 errors:0 dropped:30978 overruns:0 carrier:0<br>


          collisions:0 txqueuelen:0 <br>          RX bytes:18392139044 (17.1 GiB)  TX bytes:436943339339 (406.9 GiB)<br><br></div>I don't seem to be able to configure any ring buffers on the ib interfaces using ethtool - is there any other way of doing this?<br>


<br></div>Thanks,<br>-Andrew<br><div><div><br><br></div></div></div><div class=""><div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">On Mon, Aug 5, 2013 at 10:42 AM, Andrew McKinney <span dir="ltr"><<a href="mailto:am@sativa.org.uk" target="_blank">am@sativa.org.uk</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div><div><div><div><div><div><div><div>Hi list.<br><br>We're running a TCP middleware over IPoIB-CM (OFED-3.5-2) on Red Hat 6.4. We intend to eventually run a multicast RDMA middleware on the stack.<br>


<br></div>The hardware stack is Dell R720s (some Westmere, mostly Sandy Bridge) with bonded Mellanox MT26428 ConnectX-2 on two QLogc 12300 managed switches. We're runnign the latest firmware on the HCAs and the switches.<br>


<br></div>We have been seeing the following messages in the kernel ring, which also seems to coincide with page allocation errors:<br><br>ib0: dev_queue_xmit failed to requeue packet<br>ib0: dev_queue_xmit failed to requeue packet<br>


ib0: dev_queue_xmit failed to requeue packet<br>ib0: dev_queue_xmit failed to requeue packet<br>ib0: dev_queue_xmit failed to requeue packet<br>ib0: dev_queue_xmit failed to requeue packet<br>java: page allocation failure. order:1, mode:0x20<br>


Pid: 24410, comm: java Tainted: P           ---------------    2.6.32-279.el6.x86_64 #1<br>Call Trace:<br> <IRQ>  [<ffffffff8112759f>] ? __alloc_pages_nodemask+0x77f/0x940<br> [<ffffffff81489c00>] ? tcp_rcv_established+0x290/0x800<br>


 [<ffffffff81161d62>] ? kmem_getpages+0x62/0x170<br> [<ffffffff8116297a>] ? fallback_alloc+0x1ba/0x270<br> [<ffffffff811623cf>] ? cache_grow+0x2cf/0x320<br> [<ffffffff811626f9>] ? ____cache_alloc_node+0x99/0x160<br>


 [<ffffffff8143014d>] ? __alloc_skb+0x6d/0x190<br> [<ffffffff811635bf>] ? kmem_cache_alloc_node_notrace+0x6f/0x130<br> [<ffffffff811637fb>] ? __kmalloc_node+0x7b/0x100<br> [<ffffffff8143014d>] ? __alloc_skb+0x6d/0x190<br>


 [<ffffffff8143028d>] ? dev_alloc_skb+0x1d/0x40<br> [<ffffffffa0673f90>] ? ipoib_cm_alloc_rx_skb+0x30/0x430 [ib_ipoib]<br> [<ffffffffa067523f>] ? ipoib_cm_handle_rx_wc+0x29f/0x770 [ib_ipoib]<br> [<ffffffffa018c828>] ? mlx4_ib_poll_cq+0xa8/0x890 [mlx4_ib]<br>


 [<ffffffffa066c01d>] ? ipoib_ib_completion+0x2d/0x30 [ib_ipoib]<br> [<ffffffffa066d80b>] ? ipoib_poll+0xdb/0x190 [ib_ipoib]<br> [<ffffffff810600bc>] ? try_to_wake_up+0x24c/0x3e0<br> [<ffffffff8143f193>] ? net_rx_action+0x103/0x2f0<br>


 [<ffffffff81073ec1>] ? __do_softirq+0xc1/0x1e0<br> [<ffffffff810db800>] ? handle_IRQ_event+0x60/0x170<br> [<ffffffff8100c24c>] ? call_softirq+0x1c/0x30<br> [<ffffffff8100de85>] ? do_softirq+0x65/0xa0<br>


 [<ffffffff81073ca5>] ? irq_exit+0x85/0x90<br> [<ffffffff81505af5>] ? do_IRQ+0x75/0xf0<br> [<ffffffff8100ba53>] ? ret_from_intr+0x0/0x11<br> <EOI> <br><br></div>These appear to be genuine drops, as we are seeing gaps in our middleware which is then going on to re-cap.<br>


<br></div>We've just made a change to increase the page cache from ~90M to 128M - but what is the lists feeling on the dev_queue_xmit errors? Could they be being caused by the same issue? Unable to allocate pages in a timely manner perhaps?<br>


<br>We're not running at anywhere near high messages rates (<1000 ~450b mps).<br><br></div>I can see a thread started in 2012 where someone had caused these dev_queue_xmit using netperf and Roland had suggested that at worst one packet was being dropped. Silence after this.<br>


<br></div>Has anyone seen this behavior, or got any pointers to chase this down?<br><br></div>Cheers,<br>-Andrew<br><br></div>ibv_devinfo<br><div><br>ca_id:    mlx4_1<br>    transport:            InfiniBand (0)<br>    fw_ver:                2.9.1000<br>


    node_guid:            0002:c903:0057:2250<br>    sys_image_guid:            0002:c903:0057:2253<br>    vendor_id:            0x02c9<br>    vendor_part_id:            26428<br>    hw_ver:                0xB0<br>    board_id:            MT_0D90110009<br>


    phys_port_cnt:            1<br>    max_mr_size:            0xffffffffffffffff<br>    page_size_cap:            0xfffffe00<br>    max_qp:                163776<br>    max_qp_wr:            16351<br>    device_cap_flags:        0x007c9c76<br>


    max_sge:            32<br>    max_sge_rd:            0<br>    max_cq:                65408<br>    max_cqe:            4194303<br>    max_mr:                524272<br>    max_pd:                32764<br>    max_qp_rd_atom:            16<br>


    max_ee_rd_atom:            0<br>    max_res_rd_atom:        2620416<br>    max_qp_init_rd_atom:        128<br>    max_ee_init_rd_atom:        0<br>    atomic_cap:            ATOMIC_HCA (1)<br>    max_ee:                0<br>


    max_rdd:            0<br>    max_mw:                0<br>    max_raw_ipv6_qp:        0<br>    max_raw_ethy_qp:        0<br>    max_mcast_grp:            8192<br>    max_mcast_qp_attach:        248<br>    max_total_mcast_qp_attach:    2031616<br>


    max_ah:                0<br>    max_fmr:            0<br>    max_srq:            65472<br>    max_srq_wr:            16383<br>    max_srq_sge:            31<br>    max_pkeys:            128<br>    local_ca_ack_delay:        15<br>


        port:    1<br>            state:            PORT_ACTIVE (4)<br>            max_mtu:        4096 (5)<br>            active_mtu:        2048 (4)<br>            sm_lid:            1<br>            port_lid:        9<br>


            port_lmc:        0x00<br>            link_layer:        InfiniBand<br>            max_msg_sz:        0x40000000<br>            port_cap_flags:        0x02510868<br>            max_vl_num:        4 (3)<br>            bad_pkey_cntr:        0x0<br>


            qkey_viol_cntr:        0x0<br>            sm_sl:            0<br>            pkey_tbl_len:        128<br>            gid_tbl_len:        128<br>            subnet_timeout:        17<br>            init_type_reply:    0<br>


            active_width:        4X (2)<br>            active_speed:        10.0 Gbps (4)<br>            phys_state:        LINK_UP (5)<br>            GID[  0]:        fe80:0000:0000:0000:0002:c903:0057:2251<br><br>hca_id:    mlx4_0<br>


    transport:            InfiniBand (0)<br>    fw_ver:                2.9.1000<br>    node_guid:            0002:c903:0057:2764<br>    sys_image_guid:            0002:c903:0057:2767<br>    vendor_id:            0x02c9<br>


    vendor_part_id:            26428<br>    hw_ver:                0xB0<br>    board_id:            MT_0D90110009<br>    phys_port_cnt:            1<br>    max_mr_size:            0xffffffffffffffff<br>    page_size_cap:            0xfffffe00<br>


    max_qp:                163776<br>    max_qp_wr:            16351<br>    device_cap_flags:        0x007c9c76<br>    max_sge:            32<br>    max_sge_rd:            0<br>    max_cq:                65408<br>    max_cqe:            4194303<br>


    max_mr:                524272<br>    max_pd:                32764<br>    max_qp_rd_atom:            16<br>    max_ee_rd_atom:            0<br>    max_res_rd_atom:        2620416<br>    max_qp_init_rd_atom:        128<br>


    max_ee_init_rd_atom:        0<br>    atomic_cap:            ATOMIC_HCA (1)<br>    max_ee:                0<br>    max_rdd:            0<br>    max_mw:                0<br>    max_raw_ipv6_qp:        0<br>    max_raw_ethy_qp:        0<br>


    max_mcast_grp:            8192<br>    max_mcast_qp_attach:        248<br>    max_total_mcast_qp_attach:    2031616<br>    max_ah:                0<br>    max_fmr:            0<br>    max_srq:            65472<br>    max_srq_wr:            16383<br>


    max_srq_sge:            31<br>    max_pkeys:            128<br>    local_ca_ack_delay:        15<br>        port:    1<br>            state:            PORT_ACTIVE (4)<br>            max_mtu:        4096 (5)<br>            active_mtu:        2048 (4)<br>


            sm_lid:            1<br>            port_lid:        10<br>            port_lmc:        0x00<br>            link_layer:        InfiniBand<br>            max_msg_sz:        0x40000000<br>            port_cap_flags:        0x02510868<br>


            max_vl_num:        4 (3)<br>            bad_pkey_cntr:        0x0<br>            qkey_viol_cntr:        0x0<br>            sm_sl:            0<br>            pkey_tbl_len:        128<br>            gid_tbl_len:        128<br>


            subnet_timeout:        17<br>            init_type_reply:    0<br>            active_width:        4X (2)<br>            active_speed:        10.0 Gbps (4)<br>            phys_state:        LINK_UP (5)<br>            GID[  0]:        fe80:0000:0000:0000:0002:c903:0057:2765<br>


<br><br><div><div><div><div><div>slabtop<br><br> Active / Total Objects (% used)    : 3436408 / 5925284 (58.0%)<br> Active / Total Slabs (% used)      : 178659 / 178867 (99.9%)<br> Active / Total Caches (% used)     : 117 / 193 (60.6%)<br>


 Active / Total Size (% used)       : 422516.74K / 692339.54K (61.0%)<br> Minimum / Average / Maximum Object : 0.02K / 0.12K / 4096.00K<br><br>  OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME                   <br>


4461349 2084881  46%    0.10K 120577       37    482308K buffer_head            <br>548064 547979  99%    0.02K   3806      144     15224K avtab_node             <br>370496 368197  99%    0.03K   3308      112     13232K size-32                <br>


135534 105374  77%    0.55K  19362        7     77448K radix_tree_node        <br> 67946  51531  75%    0.07K   1282       53      5128K selinux_inode_security <br> <a href="tel:57938%C2%A0%2035717" value="+15793835717" target="_blank">57938  35717</a>  61%    0.06K    982       59      3928K size-64                <br>


 42620  42303  99%    0.19K   2131       20      8524K dentry                 <br> <a href="tel:25132%C2%A0%2025129" value="+12513225129" target="_blank">25132  25129</a>  99%    1.00K   6283        4     25132K ext4_inode_cache       <br>


 23600  23436  99%    0.19K   1180       20      4720K size-192               <br>


 18225  18189  99%    0.14K    675       27      2700K sysfs_dir_cache        <br> 17062  15025  88%    0.20K    898       19      3592K vm_area_struct         <br> 16555   9899  59%    0.05K    215       77       860K anon_vma_chain         <br>


 15456  15143  97%    0.62K   2576        6     10304K proc_inode_cache       <br> 14340   8881  61%    0.19K    717       20      2868K filp                   <br> 12090   7545  62%    0.12K    403       30      1612K size-128               <br>


 10770   8748  81%    0.25K    718       15      2872K skbuff_head_cache      <br> 10568   8365  79%    1.00K   2642        4     10568K size-1024              <br>  8924   5464  61%    0.04K     97       92       388K anon_vma               <br>


  7038   6943  98%    0.58K   1173        6      4692K inode_cache            <br>  5192   4956  95%    2.00K   2596        2     10384K size-2048              <br>  3600   3427  95%    0.50K    450        8      1800K size-512               <br>


  3498   3105  88%    0.07K     66       53       264K eventpoll_pwq          <br>  3390   3105  91%    0.12K    113       30       452K eventpoll_epi          <br>  3335   3239  97%    0.69K    667        5      2668K sock_inode_cache       <br>


  2636   2612  99%    1.62K    659        4      5272K TCP                    <br>  2380   1962  82%    0.11K     70       34       280K task_delay_info        <br>  2310   1951  84%    0.12K     77       30       308K pid                    <br>


  2136   2053  96%    0.44K    267        8      1068K ib_mad                 <br>  1992   1947  97%    2.59K    664        3      5312K task_struct            <br>  1888   1506  79%    0.06K     32       59       128K tcp_bind_bucket        <br>


  1785   1685  94%    0.25K    119       15       476K size-256               <br>  1743    695  39%    0.50K    249        7       996K skbuff_fclone_cache    <br>  1652    532  32%    0.06K     28       59       112K avc_node               <br>


  1640   1175  71%    0.19K     82       20       328K cred_jar               <br>  1456   1264  86%    0.50K    182        8       728K task_xstate            <br>  1378    781  56%    0.07K     26       53       104K Acpi-Operand           <br>


  1156    459  39%    0.11K     34       34       136K jbd2_journal_head      <br>  1050    983  93%    0.78K    210        5       840K shmem_inode_cache      <br>  1021    879  86%    4.00K   1021        1      4084K size-4096              <br>


  1020    537  52%    0.19K     51       20       204K bio-0                  <br>  1008    501  49%    0.02K      7      144        28K dm_target_io           <br>   920    463  50%    0.04K     10       92        40K dm_io                  <br>


   876    791  90%    1.00K    219        4       876K signal_cache           <br>   840    792  94%    2.06K    280        3      2240K sighand_cache          <br>   740    439  59%    0.10K     20       37        80K ext4_prealloc_space    <br>


   736    658  89%    0.04K      8       92        32K Acpi-Namespace         <br>   720    283  39%    0.08K     15       48        60K blkdev_ioc             <br>   720    294  40%    0.02K      5      144        20K jbd2_journal_handle    <br>


   708    131  18%    0.06K     12       59        48K fs_cache               <br>   630    429  68%    0.38K     63       10       252K ip_dst_cache           <br>   627    625  99%    8.00K    627        1      5016K size-8192              <br>


   616    297  48%    0.13K     22       28        88K cfq_io_context         <br>   480    249  51%    0.23K     30       16       120K cfq_queue              <br>   370    330  89%    0.75K     74        5       296K UNIX                   <br>


   368     31   8%    0.04K      4       92        16K khugepaged_mm_slot     <br>   357    325  91%    0.53K     51        7       204K idr_layer_cache        <br>   341    128  37%    0.69K     31       11       248K files_cache            <br>


   270    159  58%    0.12K      9       30        36K scsi_sense_cache       <br>   246    244  99%    1.81K    123        2       492K TCPv6                  <br>   231    131  56%    0.34K     21       11        84K blkdev_requests        <br>


   210    102  48%    1.38K     42        5       336K mm_struct              <br>   210    116  55%    0.25K     14       15        56K sgpool-8               <br>   202     14   6%    0.02K      1      202         4K jbd2_revoke_table      <br>


   192    192 100%   32.12K    192        1     12288K kmem_cache             <br>   180    121  67%    0.25K     12       15        48K scsi_cmd_cache         <br>   170    113  66%    0.11K      5       34        20K inotify_inode_mark_entry<br>


   144    121  84%    0.16K      6       24        24K sigqueue               <br>   134      4   2%    0.05K      2       67         8K ext4_free_block_extents<br>   118     26  22%    0.06K      2       59         8K fib6_nodes             <br>


   112      2   1%    0.03K      1      112         4K ip_fib_alias           <br>   112      1   0%    0.03K      1      112         4K dnotify_struct         <br>   112      2   1%    0.03K      1      112         4K sd_ext_cdb             <br>


<br></div></div></div></div></div></div></div>


</blockquote></div><br></div>


</div></div></blockquote></div><br></div></div>