[ofw] [PATCH] checked ipoib_ndis6_cm shutdown race

Smith, Stan stan.smith at intel.com
Wed Sep 29 08:58:41 PDT 2010


Tzachi Dar wrote:
> It seems to me that although the problem that started this mail
> thread is real, the solution is not in the direction that the
> original writers of the code had in mind.
>
> The real problem here is that one can call ipoib_port_destroy, which
> calls __port_destroying which calls ipoib_port_down.
>
> On the code of ipoib_port_down we move the qp to error but we don't
> wait for all buffers to return.
>
> On the code of __ipoib_cleanup() we have the following loop that
> waits for the receive/send to end:
>
>       /* Wait for all sends and receives to get flushed. */
>       while( p_port->send_mgr.depth || p_port->recv_mgr.depth )
>               cl_thread_suspend( 0 );
>
> I believe that this loop should be moved to right after the modify_qp
> and then we will know that the port is still alive while we are
> waiting for the buffers to be flushed with error.
>
> This will hopefully solve the problem without adding new locks to the
> code.

Hello,
  I like your thinking w.r.t. no locks in speed path, although suspending a possible Dispatch level thread seems dangerous for creating additional race conditions. If the thread executing the while() loop is not at dispatch, then this patch looks good.

Stepping back from the original problem of the IO worker thread not running or crashing, the real issue I see is detection of an error case in the CQ callback routine. Error defined as QP has transitioned to ERROR state for some reason (system shutdown, device disable/remove, real live IB errors (cable unplugged)).

By design the recv CQ callback handler recv_cb_internal() processes the first 128 CQ entries (default CQ size is 512) and then if more recv CQ entries are available to process, invokes the IO worker thread to finish the CQ processing thus providing processor thread scheduling fairness under heavy IPoIB traffic.

My point is the recv CV callback handler should continue processing CQ entries until empty for the error case and not invoke the IO worker thread.
The problem was my choice in how to detect the QP error condition as it added a lock in the speed path.

In recv_cb_internal() we already have the 'bad_list' as an error detection mechanism (populated by recv_mgr_filter()).
Looking further into the recv CQ processing code, using a non-zero size of the bad list is a way to recognize an error condition without a lock.

Normal case - bad_list_count == 0, invoke IO worker thread if needed.
Error case - bad_list_count > 0, keep processing CQ entries in same thread until CQ empty.
Error processing is now much faster without IO thread problems.

New patch:

--- C:/Documents and Settings/scsmith/Local Settings/Temp/ipoib_port.cpp-revBASE.svn000.tmp.cpp Wed Sep 29 08:49:15 2010
+++ C:/Documents and Settings/scsmith/My Documents/openIB-windows/SVN/gen1/trunk/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp       Wed Sep 29 08:50:00 2010
@@ -2026,6 +2026,7 @@
        ULONG                           recv_complete_flags = 0;
        BOOLEAN                 res;
        BOOLEAN                         WorkToDo = FALSE;
+       size_t                  bad_list_cnt;

        PERF_DECLARE( RecvCompBundle );
        PERF_DECLARE( RecvCb );
@@ -2097,6 +2098,7 @@
        p_port->recv_mgr.depth -= recv_cnt;

        /* Return any discarded receives to the pool */
+       bad_list_cnt = cl_qlist_count( &bad_list );
        cl_perf_start( PutRecvList );
        __buf_mgr_put_recv_list( p_port, &bad_list );
        cl_perf_stop( &p_port->p_adapter->perf, PutRecvList );
@@ -2223,7 +2225,7 @@
                CL_ASSERT( status == IB_SUCCESS );

        } else {
-               if (h_cq) {
+               if (h_cq && bad_list_cnt == 0) {
                        // increment reference to ensure no one release the object while work iteam is queued
                        ipoib_port_ref( p_port, ref_recv_cb );
                        IoQueueWorkItem( p_port->pPoWorkItem, __iopoib_WorkItem, DelayedWorkQueue, p_port);
@@ -2245,9 +2247,13 @@
        IN              const   ib_cq_handle_t                          h_cq,
        IN                              void                                            *cq_context )
 {
-       uint32_t recv_cnt;
+       uint32_t        recv_cnt;
+       boolean_t       WorkToDo;

-       __recv_cb_internal(h_cq, cq_context, &recv_cnt);
+       do
+       {
+               WorkToDo = __recv_cb_internal(h_cq, cq_context, &recv_cnt);
+       } while( rc );
 }


thoughts?

stan.

>
> Thanks
> Tzachi
>
>> -----Original Message-----
>> From: ofw-bounces at lists.openfabrics.org [mailto:ofw-
>> bounces at lists.openfabrics.org] On Behalf Of Smith, Stan
>> Sent: Monday, September 27, 2010 9:25 PM
>> To: Alex Naslednikov
>> Cc: ofw at lists.openfabrics.org
>> Subject: Re: [ofw] [PATCH] checked ipoib_ndis6_cm shutdown race
>>
>> Alex Naslednikov wrote:
>>> Hello Stan,
>>> I liked the idea, but I suppose that one should use take here
>>> spinlock on p_port->obj to ensure that there's no race right after
>>>                  "if". The code should be like:
>>> } else {
>>>             cl_obj_lock( &p_port->obj );
>>>             if ( h_cq && p_port->state == IB_QPS_RTS ) { //
>>>                       increment reference to ensure no one release
>>>                       the object while work item is queued
>>>                       ipoib_port_ref( p_port, ref_recv_cb );
>>>                              IoQueueWorkItem( p_port->pPoWorkItem,
>>>                   __iopoib_WorkItem, DelayedWorkQueue, p_port);
>>>             WorkToDo = FALSE; } else {
>>>                 WorkToDo = TRUE;
>>>             }
>>>                 cl_obj_unlock( &p_port->obj );
>>>        }
>>
>> True, although I hate to see a lock introduced into the speed path.
>> Can you think of a way to keep the lock out of the speed path?
>> Holding the lock over the call to IoQueueWorkItem() is not required;
>> perhaps the following is good enough?
>>
>>         cl_obj_lock( &p_port->obj );
>>         pstate = p_port->state;
>>         cl_obj_unlock( &p_port->obj );
>>         if ( h_cq && pstate == IB_QPS_RTS ) {
>> ...
>>
>>
>>> In addition, I found another potentially unsafe place:
>>> Index:
>>> B:/users/xalex/MLNX_WinOF-
>> 2_1_2/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
>>> ===================================================================
>>> ---
>>>
>>> B:/users/xalex/MLNX_WinOF/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
>>> (revision 6513) +++
>>>
>> B:/users/xalex/MLNX_WinOF/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
>> (working
>>>               copy) @@ -7002,8 +7002,10 @@ ipoib_set_inactive(
>>> p_port->p_adapter ); __endpt_mgr_reset_all( p_port ); }
>>
>> I'm guessing you intended cl_obj_lock() here instead of unlock?
>>
>>> +             cl_obj_unlock( &p_port->obj
>>>               ); ASSERT(p_port->state == IB_QPS_INIT ||
>>>               p_port->state == IB_QPS_ERROR); p_port->state =
>>> IB_QPS_ERROR; +             cl_obj_unlock( &p_port->obj );
>>>               KeSetEvent( &p_port->sa_event, EVENT_INCREMENT, FALSE
>>> );       }
>>>
>>> -----Original Message-----
>>> From: Smith, Stan [mailto:stan.smith at intel.com]
>>> Sent: Wednesday, September 22, 2010 6:22 PM
>>> To: Alex Naslednikov
>>> Cc: ofw at lists.openfabrics.org; Uri Habusha
>>> Subject: RE: [PATCH] checked ipoib_ndis6_cm shutdown race
>>>
>>> Hello,
>>>   Any thoughts on this patch?
>>>
>>> Smith, Stan wrote:
>>>> Hello,
>>>>   During system shutdown I have witnessed a checked ipoib_ndis6_cm
>>>> IO work thread fail:
>>>>
>>>> 1) IO work thread is blocked from running due to scheduling
>>>> priorities beyond the point in time at which port_destroy() wants
>>>> to delete the port object [cl_obj_destroy( &p_port->obj 0)]. The
>>>> port object delete fails (ASSERT obj_ref > 0 fires) due to the
>>>> outstanding port references incurred by remaining posted recv
>>>> buffers. The 1st 128 WorkRequests have been pulled from the CQ by
>>>> __recv_cb_internal(), which then posts an IO work request to
>>>> process the remaining 384 recv work requests. The IO work request
>>>> does not run prior to port_detroy() being called.
>>>>
>>>> 2) The IO thread attempts to run but blows up (BSOD invalid memory
>>>> reference) as port structures required by the IO work thread have
>>>> been free()'ed.
>>>>
>>>> The fix is to recognize the port is not in the IB_QPS_RTS state, do
>>>> not schedule an IO work thread request and continue to pull recv
>>>> work requests from the CQ until empty.
>>>>
>>>> Code snippets:
>>>>       } else {
>>>>               if ( h_cq && p_port->state == IB_QPS_RTS ) {
>>>>                       // increment reference to ensure no one
>>>>                       release the object while work iteam is queued
>>>>                       ipoib_port_ref( p_port, ref_recv_cb );
>>>>                       IoQueueWorkItem( p_port->pPoWorkItem,
>>>>                       __iopoib_WorkItem, DelayedWorkQueue, p_port);
>>>>               WorkToDo = FALSE;               } else { WorkToDo =
>>>> TRUE; } }
>>>>
>>>> __recv_cb(
>>>>       IN              const   ib_cq_handle_t
>>>>       h_cq, IN                              void
>>>>       *cq_context ) { uint32_t        recv_cnt;
>>>>       boolean_t       WorkToDo;
>>>>
>>>>       do
>>>>       {
>>>>               WorkToDo = __recv_cb_internal(h_cq, cq_context,
>>>> &recv_cnt);       } while( WorkToDo ); }
>>>>
>>>>
>>>> --- A/ulp/ipoib_ndis6_cm/kernel/ipoib_port.cpp        Mon Sep 13
>>>> 15:58:08 2010 +++ B/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp   Mon
>>>> Sep 20 08:47:08 2010 @@ -2222,7 +2222,7 @@
>>>>               CL_ASSERT( status == IB_SUCCESS );
>>>>
>>>>       } else {
>>>> -             if (h_cq) {
>>>> +             if ( h_cq && p_port->state == IB_QPS_RTS ) {
>>>>                       // increment reference to ensure no one
>>>>                       release the object while work iteam is queued
>>>>                       ipoib_port_ref( p_port, ref_recv_cb );
>>>> IoQueueWorkItem( p_port->pPoWorkItem, __iopoib_WorkItem,
>>>> DelayedWorkQueue, p_port); @@ -2244,9 +2244,13 @@
>>>>       IN              const   ib_cq_handle_t
>>>>       h_cq, IN                              void
>>>> *cq_context )  { -     uint32_t recv_cnt;
>>>> +     uint32_t        recv_cnt;
>>>> +     boolean_t       WorkToDo;
>>>>
>>>> -     __recv_cb_internal(h_cq, cq_context, &recv_cnt); +     do +
>>>> { +             WorkToDo = __recv_cb_internal(h_cq, cq_context,
>>>>  &recv_cnt); +     } while( WorkToDo ); }
>>
>> _______________________________________________
>> ofw mailing list
>> ofw at lists.openfabrics.org
>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw




More information about the ofw mailing list