[ofw] [IPOIB_NDIS6_CM] enhance wc linking loop performance by removing array index calculations

Thu Aug 26 16:32:41 PDT 2010

Tzachi Dar wrote:
> Please commit.

Done in svn 2877

>
> Thanks
> Tzachi
>
>> -----Original Message-----
>> From: Smith, Stan [mailto:stan.smith at intel.com]
>> Sent: Monday, August 23, 2010 8:55 PM
>> To: Tzachi Dar; Hefty, Sean; Alex Naslednikov
>> Cc: ofw at lists.openfabrics.org
>> Subject: RE: [IPOIB_NDIS6_CM] enhance wc linking loop performance by
>> removing array index calculations
>>
>> Tzachi Dar wrote:
>>> By the way, by experiments that xalex has done in the past he saw
>>> that actually the fastest way to initialize this lists was to
>>> create the list once on another place and then do a memcopy of the
>>> result every time you need the list.
>>>
>>> That said, I'm not sure that we want to go in that way. The best way
>>> to improve performance here is very likely to drop the linked lists
>>> and move to an array...
>>
>> Yes - an array interface would be the best.
>>
>> As a side note, using Sean's loop test pgm (modified to closer
>> resemble ipoib wc init code), built in the WDK (x64) environment
>> with no additional optimization switches, Sean's results are
>> replicated:
>> The wc pointer loop is slightly faster than using array indexes.
>>
>> MF:\Test-pgms\Loop-Timing>date /t & bin\amd64\lt.exe Mon 08/23/2010
>> loops 10000000 s 95622189567 e 95636499284 f 14318180
>> f1 0.99941
>> f2 0.78953
>>
>> where f1 ==
>>         for (i = 0; i < LOOPS; i++) {
>>                 struct ibv_wc wc[MAX_SEND_WC];
>>                 int i;
>>
>>                 for (i = 0; i < MAX_SEND_WC; i++) {
>>                         wc[i].next = &wc[i + 1];
>>                 }
>>                 wc[MAX_SEND_WC - 1].next = NULL;
>>         }
>>
>> f2 ==
>>         for (i = 0; i < LOOPS; i++) {
>>                 struct ibv_wc wc[MAX_SEND_WC];
>>                 struct ibv_wc *p;
>>
>>                 for (p = wc; p < &wc[MAX_SEND_WC - 1]; p++ ) {
>>                         p->next = p + 1;
>>                 }
>>                 p->next = NULL;
>>         }
>>
>> I beleve the ipoib loop performance patch is valid to commit,
>> although not critical.
>>
>> stan.
>>
>>>
>>> Thanks
>>> Tzachi
>>>
>>>> -----Original Message-----
>>>> From: Smith, Stan [mailto:stan.smith at intel.com]
>>>> Sent: Monday, August 09, 2010 11:23 PM
>>>> To: Tzachi Dar
>>>> Cc: ofw at lists.openfabrics.org
>>>> Subject: RE: [IPOIB_NDIS6_CM] enhance wc linking loop performance
>>>> by removing array index calculations
>>>>
>>>> Tzachi Dar wrote:
>>>>> Have you been actually being able to measure a difference?
>>>>>
>>>>> I believe that your code is better, but I wander if it really has
>>>>> an affect that we can measure.
>>>>
>>>> In the following code sequence modified to use pointers (with
>>>> Sean's observations)
>>>>
>>>> for( i = 0; i < MAX_SEND_WC; i++ )
>>>>      wc[i].p_next = &wc[i + 1];
>>>> wc[MAX_SEND_WC - 1].p_next = NULL;
>>>>
>>>> for( p_free=wc; p_free < &wc[MAX_SEND_WC - 1]; p_free++ )
>>>>      p_free->p_next = p_free + 1;
>>>> p_free->p_next = NULL;
>>>>
>>>> If the MS WDK compiler optimizations are really 'good', it might
>>>> optimize the loops to basically the same instruction sequence.  I
>>>> do not believe this to be the case.
>>>>
>>>> The slightly faster execution time is based on the observation that
>>>> the total number of instructions executed is reduced by skipping
>>>> the array index arithmetic by use of pointers.
>>>>
>>>> Since these loops live in the Tx & Rx speed paths, every little
>>>> bit helps.
>>>>
>>>>
>>>>>
>>>>> Thanks
>>>>> Tzachi
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Smith, Stan [mailto:stan.smith at intel.com]
>>>>>> Sent: Wednesday, August 04, 2010 8:43 PM
>>>>>> To: Tzachi Dar
>>>>>> Cc: ofw at lists.openfabrics.org
>>>>>> Subject: [IPOIB_NDIS6_CM] enhance wc linking loop performance by
>>>>>> removing array index calculations
>>>>>>
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> While reading IPOIB code I noticed a minor speed enhancement in
>>>>>> CQ callback routines. When linking WC (work complete) items into
>>>>>> a list, by removing the array index calculations in favor of
>>>>>> pointer arithmetic the loop will execute slightly faster.
>>>>>>
>>>>>> Worth a commit?
>>>>>>
>>>>>> stan.
>>>>>>
>>>>>> --- A/ulp/ipoib_NDIS6_CM/kernel/ipoib_endpoint.cpp      Wed Aug
>>>>>> 04 10:30:43 2010 +++
>>>>>> B/ulp/ipoib_NDIS6_CM/kernel/ipoib_endpoint.cpp Wed Aug 04
>>>>>>                 10:28:59 2010 @@ -888,9 +888,10 @@
>>>>>> p_port->p_adapter->p_ifc->modify_qp( p_endpt-
>>>>>>> conn.h_send_qp, &mod_attr );
>>>>>>                 p_port->p_adapter->p_ifc->modify_qp( p_endpt-
>>>>>>> conn.h_recv_qp, &mod_attr );
>>>>>>
>>>>>> -               for( i = 0; i < MAX_RECV_WC; i++ )
>>>>>> -                       wc[i].p_next = &wc[i + 1];
>>>>>> -               wc[MAX_RECV_WC - 1].p_next = NULL;
>>>>>> +       for( p_free_wc=wc; p_free_wc < &wc[MAX_RECV_WC];
>>>>>> p_free_wc++ ) +               p_free_wc->p_next = p_free_wc + 1;
>>>>>> + +       (--p_free_wc)->p_next = NULL;
>>>>>>
>>>>>>                 do
>>>>>>                 {
>>>>>>
>>>>>> --- A/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp  Wed Aug 04
>>>>>> 10:29:33 2010 +++ B/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
>>>>>> Wed Aug 04 10:28:31 2010 @@ -1987,7 +1987,6 @@
>>>>>>         ib_wc_t                         wc[MAX_RECV_WC], *p_free,
>>>>>>         *p_wc; int32_t                         NBL_cnt, recv_cnt
>>>>>>         = 0, shortage, discarded; cl_qlist_t
>>>>>> done_list, bad_list; -       size_t                          i;
>>>>>>         ULONG                           recv_complete_flags = 0;
>>>>>>         BOOLEAN                         res;
>>>>>>
>>>>>> @@ -2017,9 +2016,11 @@
>>>>>>         cl_qlist_init( &bad_list );
>>>>>>
>>>>>>         ipoib_port_ref( p_port, ref_recv_cb );
>>>>>> -       for( i = 0; i < MAX_RECV_WC; i++ )
>>>>>> -               wc[i].p_next = &wc[i + 1];
>>>>>> -       wc[MAX_RECV_WC - 1].p_next = NULL;
>>>>>> +
>>>>>> +       for( p_free=wc; p_free < &wc[MAX_RECV_WC]; p_free++ )
>>>>>> +               p_free->p_next = p_free + 1;
>>>>>> +
>>>>>> +       (--p_free)->p_next = NULL;
>>>>>>
>>>>>>         /*
>>>>>>          * We'll be accessing the endpoint map so take a
>>>>>>         reference @@ -5769,7 +5770,6 @@ cl_qlist_t
>>>>>>         done_list; ipoib_endpt_t           *p_endpt;
>>>>>>         ip_stat_sel_t           type;
>>>>>> -       size_t                          i;
>>>>>>         NET_BUFFER                      *p_netbuffer = NULL;
>>>>>>         ipoib_send_NB_SG        *s_buf;
>>>>>>
>>>>>> @@ -5798,9 +5798,10 @@
>>>>>>
>>>>>>
>>>>>> //cl_qlist_check_validity(&p_port->send_mgr.pending_list);
>>>>>> ipoib_port_ref( p_port, ref_send_cb );
>>>>>>
>>>>>> -       for( i = 0; i < MAX_SEND_WC; i++ )
>>>>>> -               wc[i].p_next = &wc[i + 1];
>>>>>> -       wc[MAX_SEND_WC - 1].p_next = NULL;
>>>>>> +       for( p_free=wc; p_free < &wc[MAX_SEND_WC]; p_free++ )
>>>>>> +               p_free->p_next = p_free + 1;
>>>>>> +
>>>>>> +       (--p_free)->p_next = NULL;
>>>>>>
>>>>>>         do
>>>>>>         {