[ofw] [PATCH] checked ipoib_ndis6_cm shutdown race

Mon Oct 4 11:05:10 PDT 2010

Tzachi Dar wrote:
> Hi Stan,
>
> Please note that although the patch that you have created makes the
> processing of the receive buffers much faster it does not really fix
> the race.

True enough.
The actual cause of the race condition may be induced by the extra time required to tear-down existing RC connections; I realize this is future work, but nonetheless exists in my sandbox.
RC connection tear-down of QPs bound to a SRQ while waiting for the async error callback to run in order to flush the CQ is indeed a lengthy process (modeled after Linux IPoIB SRQ/QP drain code).

I am still not clear why you prefer to run the heavy-weight machinery designed for system fairness (IO work thread) in the QP error case?  Running the IO thread to flush CQ entries when the CQ callback routine knows the QP is in the error state with more CQ entries to process, seems like an invitation for additional race conditions?

With your proposal, we have the PNP port_down() callback thread which puts the QP in the error state then spins/sleeps on (p_port->(send_mgr.depth || p_port->recv_mgr.depth) meanwhile the CQ callback routine runs to flush the 1st 128 CQ entries in the error state (WC FLUSHED error), then the IO work thread is scheduled to process/flush the additional 384 CQ entries after the CQ callback routine exits; meanwhile competing with the port_down() spin/sleep thread.
I prefer to keep things simple - deal with the known facts immediately, not deferred.

That said, I defer to your IPoIB experience and will add your patch (removing mine) thus allowing the IO work thread to run in the QP error state to flush existing CQ entries.

stan.

>
> The proposed fix that I have added (doing the loop) after the modify
> qp should be fine since modify qp itself should only be called at
> passive level.
>
> By the way, I wander if the problem that you see (receive takes too
> long in the case of qp error) does not come from a different place
> altogether. If the qp is turned into the error state, than all
> buffers will be flushed with error. In the current implementation,
> they will be posted again very soon, but since the qp is in error,
> they will be flushed immediately again.
>
> In any case, we need to look at the root cause of the race and fix
> there. (I believe that my proposed fix should indeed fix it).
>
> Thanks
> Tzachi
>
>> -----Original Message-----
>> From: Smith, Stan [mailto:stan.smith at intel.com]
>> Sent: Wednesday, September 29, 2010 5:59 PM
>> To: Tzachi Dar; Alex Naslednikov
>> Cc: ofw at lists.openfabrics.org
>> Subject: RE: [PATCH] checked ipoib_ndis6_cm shutdown race
>>
>> Tzachi Dar wrote:
>>> It seems to me that although the problem that started this mail
>>> thread is real, the solution is not in the direction that the
>>> original writers of the code had in mind.
>>>
>>> The real problem here is that one can call ipoib_port_destroy, which
>>> calls __port_destroying which calls ipoib_port_down.
>>>
>>> On the code of ipoib_port_down we move the qp to error but we don't
>>> wait for all buffers to return.
>>>
>>> On the code of __ipoib_cleanup() we have the following loop that
>>> waits for the receive/send to end:
>>>
>>>       /* Wait for all sends and receives to get flushed. */
>>>       while( p_port->send_mgr.depth || p_port->recv_mgr.depth )
>>>               cl_thread_suspend( 0 );
>>>
>>> I believe that this loop should be moved to right after the
>>> modify_qp and then we will know that the port is still alive while
>>> we are waiting for the buffers to be flushed with error.
>>>
>>> This will hopefully solve the problem without adding new locks to
>>> the code.
>>
>> Hello,
>>   I like your thinking w.r.t. no locks in speed path, although
>> suspending a possible Dispatch level thread seems dangerous for
>> creating additional race conditions. If the thread executing the
>> while() loop is not at dispatch, then this patch looks good.
>>
>> Stepping back from the original problem of the IO worker thread not
>> running or crashing, the real issue I see is detection of an error
>> case in the CQ callback routine. Error defined as QP has
>> transitioned to ERROR state for some reason (system shutdown, device
>> disable/remove, real live IB errors (cable unplugged)).
>>
>> By design the recv CQ callback handler recv_cb_internal() processes
>> the first 128 CQ entries (default CQ size is 512) and then if more
>> recv CQ entries are available to process, invokes the IO worker
>> thread to finish the CQ processing thus providing processor thread
>> scheduling fairness under heavy IPoIB traffic.
>>
>> My point is the recv CV callback handler should continue processing
>> CQ entries until empty for the error case and not invoke the IO
>> worker thread. The problem was my choice in how to detect the QP
>> error condition as it added a lock in the speed path.
>>
>> In recv_cb_internal() we already have the 'bad_list' as an error
>> detection mechanism (populated by recv_mgr_filter()).
>> Looking further into the recv CQ processing code, using a non-zero
>> size of the bad list is a way to recognize an error condition
>> without a lock.
>>
>> Normal case - bad_list_count == 0, invoke IO worker thread if needed.
>> Error case - bad_list_count > 0, keep processing CQ entries in same
>> thread until CQ empty. Error processing is now much faster without
>> IO thread problems.
>>
>> New patch:
>>
>> --- C:/Documents and Settings/scsmith/Local
>> Settings/Temp/ipoib_port.cpp-revBASE.svn000.tmp.cpp Wed Sep 29
>> 08:49:15 2010 +++ C:/Documents and Settings/scsmith/My
>> Documents/openIB-
>> windows/SVN/gen1/trunk/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
>> Wed Sep 29 08:50:00 2010
>> @@ -2026,6 +2026,7 @@
>>         ULONG                           recv_complete_flags = 0;
>>         BOOLEAN                 res;
>>         BOOLEAN                         WorkToDo = FALSE;
>> +       size_t                  bad_list_cnt;
>>
>>         PERF_DECLARE( RecvCompBundle );
>>         PERF_DECLARE( RecvCb );
>> @@ -2097,6 +2098,7 @@
>>         p_port->recv_mgr.depth -= recv_cnt;
>>
>>         /* Return any discarded receives to the pool */
>> +       bad_list_cnt = cl_qlist_count( &bad_list );
>>         cl_perf_start( PutRecvList );
>>         __buf_mgr_put_recv_list( p_port, &bad_list );
>>         cl_perf_stop( &p_port->p_adapter->perf, PutRecvList ); @@
>>                 -2223,7 +2225,7 @@ CL_ASSERT( status == IB_SUCCESS );
>>
>>         } else {
>> -               if (h_cq) {
>> +               if (h_cq && bad_list_cnt == 0) {
>>                         // increment reference to ensure no one
>> release the object while work iteam is queued
>>                         ipoib_port_ref( p_port, ref_recv_cb );
>>                         IoQueueWorkItem( p_port->pPoWorkItem,
>> __iopoib_WorkItem, DelayedWorkQueue, p_port);
>> @@ -2245,9 +2247,13 @@
>>         IN              const   ib_cq_handle_t
>> h_cq,
>>         IN                              void
>> *cq_context )
>>  {
>> -       uint32_t recv_cnt;
>> +       uint32_t        recv_cnt;
>> +       boolean_t       WorkToDo;
>>
>> -       __recv_cb_internal(h_cq, cq_context, &recv_cnt); +       do
>> +       {
>> +               WorkToDo = __recv_cb_internal(h_cq, cq_context,
>> &recv_cnt); +       } while( rc );
>>  }
>>
>>
>> thoughts?
>>
>> stan.
>>
>>>
>>> Thanks
>>> Tzachi
>>>
>>>> -----Original Message-----
>>>> From: ofw-bounces at lists.openfabrics.org [mailto:ofw-
>>>> bounces at lists.openfabrics.org] On Behalf Of Smith, Stan
>>>> Sent: Monday, September 27, 2010 9:25 PM
>>>> To: Alex Naslednikov
>>>> Cc: ofw at lists.openfabrics.org
>>>> Subject: Re: [ofw] [PATCH] checked ipoib_ndis6_cm shutdown race
>>>>
>>>> Alex Naslednikov wrote:
>>>>> Hello Stan,
>>>>> I liked the idea, but I suppose that one should use take here
>>>>> spinlock on p_port->obj to ensure that there's no race right after
>>>>>                  "if". The code should be like:
>>>>> } else {
>>>>>             cl_obj_lock( &p_port->obj );
>>>>>             if ( h_cq && p_port->state == IB_QPS_RTS ) { //
>>>>>                       increment reference to ensure no one release
>>>>>                       the object while work item is queued
>>>>>                       ipoib_port_ref( p_port, ref_recv_cb );
>>>>>                              IoQueueWorkItem( p_port->pPoWorkItem,
>>>>>                   __iopoib_WorkItem, DelayedWorkQueue, p_port);
>>>>>             WorkToDo = FALSE; } else {
>>>>>                 WorkToDo = TRUE;
>>>>>             }
>>>>>                 cl_obj_unlock( &p_port->obj );
>>>>>        }
>>>>
>>>> True, although I hate to see a lock introduced into the speed path.
>>>> Can you think of a way to keep the lock out of the speed path?
>>>> Holding the lock over the call to IoQueueWorkItem() is not
>>>> required; perhaps the following is good enough?
>>>>
>>>>         cl_obj_lock( &p_port->obj );
>>>>         pstate = p_port->state;
>>>>         cl_obj_unlock( &p_port->obj );
>>>>         if ( h_cq && pstate == IB_QPS_RTS ) {
>>>> ...
>>>>
>>>>
>>>>> In addition, I found another potentially unsafe place:
>>>>> Index:
>>>>> B:/users/xalex/MLNX_WinOF-
>>>> 2_1_2/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
>>>>> ===================================================================
>>>>> ---
>>>>>
>>>>> B:/users/xalex/MLNX_WinOF/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
>>>>> (revision 6513) +++
>>>>>
>>>> B:/users/xalex/MLNX_WinOF/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
>>>> (working
>>>>>               copy) @@ -7002,8 +7002,10 @@ ipoib_set_inactive(
>>>>> p_port->p_adapter ); __endpt_mgr_reset_all( p_port ); }
>>>>
>>>> I'm guessing you intended cl_obj_lock() here instead of unlock?
>>>>
>>>>> +             cl_obj_unlock( &p_port->obj
>>>>>               ); ASSERT(p_port->state == IB_QPS_INIT ||
>>>>>               p_port->state == IB_QPS_ERROR); p_port->state =
>>>>> IB_QPS_ERROR; +             cl_obj_unlock( &p_port->obj );
>>>>>               KeSetEvent( &p_port->sa_event, EVENT_INCREMENT,
>>>>> FALSE );       }
>>>>>
>>>>> -----Original Message-----
>>>>> From: Smith, Stan [mailto:stan.smith at intel.com]
>>>>> Sent: Wednesday, September 22, 2010 6:22 PM
>>>>> To: Alex Naslednikov
>>>>> Cc: ofw at lists.openfabrics.org; Uri Habusha
>>>>> Subject: RE: [PATCH] checked ipoib_ndis6_cm shutdown race
>>>>>
>>>>> Hello,
>>>>>   Any thoughts on this patch?
>>>>>
>>>>> Smith, Stan wrote:
>>>>>> Hello,
>>>>>>   During system shutdown I have witnessed a checked
>>>>>> ipoib_ndis6_cm IO work thread fail:
>>>>>>
>>>>>> 1) IO work thread is blocked from running due to scheduling
>>>>>> priorities beyond the point in time at which port_destroy() wants
>>>>>> to delete the port object [cl_obj_destroy( &p_port->obj 0)]. The
>>>>>> port object delete fails (ASSERT obj_ref > 0 fires) due to the
>>>>>> outstanding port references incurred by remaining posted recv
>>>>>> buffers. The 1st 128 WorkRequests have been pulled from the CQ by
>>>>>> __recv_cb_internal(), which then posts an IO work request to
>>>>>> process the remaining 384 recv work requests. The IO work request
>>>>>> does not run prior to port_detroy() being called.
>>>>>>
>>>>>> 2) The IO thread attempts to run but blows up (BSOD invalid
>>>>>> memory reference) as port structures required by the IO work
>>>>>> thread have been free()'ed.
>>>>>>
>>>>>> The fix is to recognize the port is not in the IB_QPS_RTS state,
>>>>>> do not schedule an IO work thread request and continue to pull
>>>>>> recv work requests from the CQ until empty.
>>>>>>
>>>>>> Code snippets:
>>>>>>       } else {
>>>>>>               if ( h_cq && p_port->state == IB_QPS_RTS ) {
>>>>>>                       // increment reference to ensure no one
>>>>>>                       release the object while work iteam is
>>>>>>                       queued ipoib_port_ref( p_port, ref_recv_cb
>>>>>>                       ); IoQueueWorkItem( p_port->pPoWorkItem,
>>>>>>                       __iopoib_WorkItem, DelayedWorkQueue,
>>>>>>               p_port); WorkToDo = FALSE;               } else {
>>>>>> WorkToDo = TRUE; } }
>>>>>>
>>>>>> __recv_cb(
>>>>>>       IN              const   ib_cq_handle_t
>>>>>>       h_cq, IN                              void
>>>>>>       *cq_context ) { uint32_t        recv_cnt;
>>>>>>       boolean_t       WorkToDo;
>>>>>>
>>>>>>       do
>>>>>>       {
>>>>>>               WorkToDo = __recv_cb_internal(h_cq, cq_context,
>>>>>> &recv_cnt);       } while( WorkToDo ); }
>>>>>>
>>>>>>
>>>>>> --- A/ulp/ipoib_ndis6_cm/kernel/ipoib_port.cpp        Mon Sep 13
>>>>>> 15:58:08 2010 +++ B/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
>>>>>> Mon Sep 20 08:47:08 2010 @@ -2222,7 +2222,7 @@
>>>>>>               CL_ASSERT( status == IB_SUCCESS );
>>>>>>
>>>>>>       } else {
>>>>>> -             if (h_cq) {
>>>>>> +             if ( h_cq && p_port->state == IB_QPS_RTS ) {
>>>>>>                       // increment reference to ensure no one
>>>>>>                       release the object while work iteam is
>>>>>>                       queued ipoib_port_ref( p_port, ref_recv_cb
>>>>>> );
>>>>>> IoQueueWorkItem( p_port->pPoWorkItem, __iopoib_WorkItem,
>>>>>> DelayedWorkQueue, p_port); @@ -2244,9 +2244,13 @@
>>>>>>       IN              const   ib_cq_handle_t
>>>>>>       h_cq, IN                              void
>>>>>> *cq_context )  { -     uint32_t recv_cnt;
>>>>>> +     uint32_t        recv_cnt;
>>>>>> +     boolean_t       WorkToDo;
>>>>>>
>>>>>> -     __recv_cb_internal(h_cq, cq_context, &recv_cnt); +     do +
>>>>>> { +             WorkToDo = __recv_cb_internal(h_cq, cq_context,
>>>>>>  &recv_cnt); +     } while( WorkToDo ); }
>>>>
>>>> _______________________________________________
>>>> ofw mailing list
>>>> ofw at lists.openfabrics.org
>>>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw