[ofw] [PATCH] checked ipoib_ndis6_cm shutdown race

Sun Oct 3 06:25:34 PDT 2010

Hi Stan,

Please note that although the patch that you have created makes the processing of the receive buffers much faster it does not really fix the race.

The proposed fix that I have added (doing the loop) after the modify qp should be fine since modify qp itself should only be called at passive level.

By the way, I wander if the problem that you see (receive takes too long in the case of qp error) does not come from a different place altogether. If the qp is turned into the error state, than all buffers will be flushed with error. In the current implementation, they will be posted again very soon, but since the qp is in error, they will be flushed immediately again.

In any case, we need to look at the root cause of the race and fix there. (I believe that my proposed fix should indeed fix it).

Thanks
Tzachi

> -----Original Message-----
> From: Smith, Stan [mailto:stan.smith at intel.com]
> Sent: Wednesday, September 29, 2010 5:59 PM
> To: Tzachi Dar; Alex Naslednikov
> Cc: ofw at lists.openfabrics.org
> Subject: RE: [PATCH] checked ipoib_ndis6_cm shutdown race
> 
> Tzachi Dar wrote:
> > It seems to me that although the problem that started this mail
> > thread is real, the solution is not in the direction that the
> > original writers of the code had in mind.
> >
> > The real problem here is that one can call ipoib_port_destroy, which
> > calls __port_destroying which calls ipoib_port_down.
> >
> > On the code of ipoib_port_down we move the qp to error but we don't
> > wait for all buffers to return.
> >
> > On the code of __ipoib_cleanup() we have the following loop that
> > waits for the receive/send to end:
> >
> >       /* Wait for all sends and receives to get flushed. */
> >       while( p_port->send_mgr.depth || p_port->recv_mgr.depth )
> >               cl_thread_suspend( 0 );
> >
> > I believe that this loop should be moved to right after the modify_qp
> > and then we will know that the port is still alive while we are
> > waiting for the buffers to be flushed with error.
> >
> > This will hopefully solve the problem without adding new locks to the
> > code.
> 
> Hello,
>   I like your thinking w.r.t. no locks in speed path, although
> suspending a possible Dispatch level thread seems dangerous for
> creating additional race conditions. If the thread executing the
> while() loop is not at dispatch, then this patch looks good.
> 
> Stepping back from the original problem of the IO worker thread not
> running or crashing, the real issue I see is detection of an error case
> in the CQ callback routine. Error defined as QP has transitioned to
> ERROR state for some reason (system shutdown, device disable/remove,
> real live IB errors (cable unplugged)).
> 
> By design the recv CQ callback handler recv_cb_internal() processes the
> first 128 CQ entries (default CQ size is 512) and then if more recv CQ
> entries are available to process, invokes the IO worker thread to
> finish the CQ processing thus providing processor thread scheduling
> fairness under heavy IPoIB traffic.
> 
> My point is the recv CV callback handler should continue processing CQ
> entries until empty for the error case and not invoke the IO worker
> thread.
> The problem was my choice in how to detect the QP error condition as it
> added a lock in the speed path.
> 
> In recv_cb_internal() we already have the 'bad_list' as an error
> detection mechanism (populated by recv_mgr_filter()).
> Looking further into the recv CQ processing code, using a non-zero size
> of the bad list is a way to recognize an error condition without a
> lock.
> 
> Normal case - bad_list_count == 0, invoke IO worker thread if needed.
> Error case - bad_list_count > 0, keep processing CQ entries in same
> thread until CQ empty.
> Error processing is now much faster without IO thread problems.
> 
> New patch:
> 
> --- C:/Documents and Settings/scsmith/Local
> Settings/Temp/ipoib_port.cpp-revBASE.svn000.tmp.cpp Wed Sep 29 08:49:15
> 2010
> +++ C:/Documents and Settings/scsmith/My Documents/openIB-
> windows/SVN/gen1/trunk/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
> Wed Sep 29 08:50:00 2010
> @@ -2026,6 +2026,7 @@
>         ULONG                           recv_complete_flags = 0;
>         BOOLEAN                 res;
>         BOOLEAN                         WorkToDo = FALSE;
> +       size_t                  bad_list_cnt;
> 
>         PERF_DECLARE( RecvCompBundle );
>         PERF_DECLARE( RecvCb );
> @@ -2097,6 +2098,7 @@
>         p_port->recv_mgr.depth -= recv_cnt;
> 
>         /* Return any discarded receives to the pool */
> +       bad_list_cnt = cl_qlist_count( &bad_list );
>         cl_perf_start( PutRecvList );
>         __buf_mgr_put_recv_list( p_port, &bad_list );
>         cl_perf_stop( &p_port->p_adapter->perf, PutRecvList );
> @@ -2223,7 +2225,7 @@
>                 CL_ASSERT( status == IB_SUCCESS );
> 
>         } else {
> -               if (h_cq) {
> +               if (h_cq && bad_list_cnt == 0) {
>                         // increment reference to ensure no one release
> the object while work iteam is queued
>                         ipoib_port_ref( p_port, ref_recv_cb );
>                         IoQueueWorkItem( p_port->pPoWorkItem,
> __iopoib_WorkItem, DelayedWorkQueue, p_port);
> @@ -2245,9 +2247,13 @@
>         IN              const   ib_cq_handle_t
> h_cq,
>         IN                              void
> *cq_context )
>  {
> -       uint32_t recv_cnt;
> +       uint32_t        recv_cnt;
> +       boolean_t       WorkToDo;
> 
> -       __recv_cb_internal(h_cq, cq_context, &recv_cnt);
> +       do
> +       {
> +               WorkToDo = __recv_cb_internal(h_cq, cq_context,
> &recv_cnt);
> +       } while( rc );
>  }
> 
> 
> thoughts?
> 
> stan.
> 
> >
> > Thanks
> > Tzachi
> >
> >> -----Original Message-----
> >> From: ofw-bounces at lists.openfabrics.org [mailto:ofw-
> >> bounces at lists.openfabrics.org] On Behalf Of Smith, Stan
> >> Sent: Monday, September 27, 2010 9:25 PM
> >> To: Alex Naslednikov
> >> Cc: ofw at lists.openfabrics.org
> >> Subject: Re: [ofw] [PATCH] checked ipoib_ndis6_cm shutdown race
> >>
> >> Alex Naslednikov wrote:
> >>> Hello Stan,
> >>> I liked the idea, but I suppose that one should use take here
> >>> spinlock on p_port->obj to ensure that there's no race right after
> >>>                  "if". The code should be like:
> >>> } else {
> >>>             cl_obj_lock( &p_port->obj );
> >>>             if ( h_cq && p_port->state == IB_QPS_RTS ) { //
> >>>                       increment reference to ensure no one release
> >>>                       the object while work item is queued
> >>>                       ipoib_port_ref( p_port, ref_recv_cb );
> >>>                              IoQueueWorkItem( p_port->pPoWorkItem,
> >>>                   __iopoib_WorkItem, DelayedWorkQueue, p_port);
> >>>             WorkToDo = FALSE; } else {
> >>>                 WorkToDo = TRUE;
> >>>             }
> >>>                 cl_obj_unlock( &p_port->obj );
> >>>        }
> >>
> >> True, although I hate to see a lock introduced into the speed path.
> >> Can you think of a way to keep the lock out of the speed path?
> >> Holding the lock over the call to IoQueueWorkItem() is not required;
> >> perhaps the following is good enough?
> >>
> >>         cl_obj_lock( &p_port->obj );
> >>         pstate = p_port->state;
> >>         cl_obj_unlock( &p_port->obj );
> >>         if ( h_cq && pstate == IB_QPS_RTS ) {
> >> ...
> >>
> >>
> >>> In addition, I found another potentially unsafe place:
> >>> Index:
> >>> B:/users/xalex/MLNX_WinOF-
> >> 2_1_2/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
> >>> ===================================================================
> >>> ---
> >>>
> >>> B:/users/xalex/MLNX_WinOF/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
> >>> (revision 6513) +++
> >>>
> >> B:/users/xalex/MLNX_WinOF/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp
> >> (working
> >>>               copy) @@ -7002,8 +7002,10 @@ ipoib_set_inactive(
> >>> p_port->p_adapter ); __endpt_mgr_reset_all( p_port ); }
> >>
> >> I'm guessing you intended cl_obj_lock() here instead of unlock?
> >>
> >>> +             cl_obj_unlock( &p_port->obj
> >>>               ); ASSERT(p_port->state == IB_QPS_INIT ||
> >>>               p_port->state == IB_QPS_ERROR); p_port->state =
> >>> IB_QPS_ERROR; +             cl_obj_unlock( &p_port->obj );
> >>>               KeSetEvent( &p_port->sa_event, EVENT_INCREMENT, FALSE
> >>> );       }
> >>>
> >>> -----Original Message-----
> >>> From: Smith, Stan [mailto:stan.smith at intel.com]
> >>> Sent: Wednesday, September 22, 2010 6:22 PM
> >>> To: Alex Naslednikov
> >>> Cc: ofw at lists.openfabrics.org; Uri Habusha
> >>> Subject: RE: [PATCH] checked ipoib_ndis6_cm shutdown race
> >>>
> >>> Hello,
> >>>   Any thoughts on this patch?
> >>>
> >>> Smith, Stan wrote:
> >>>> Hello,
> >>>>   During system shutdown I have witnessed a checked ipoib_ndis6_cm
> >>>> IO work thread fail:
> >>>>
> >>>> 1) IO work thread is blocked from running due to scheduling
> >>>> priorities beyond the point in time at which port_destroy() wants
> >>>> to delete the port object [cl_obj_destroy( &p_port->obj 0)]. The
> >>>> port object delete fails (ASSERT obj_ref > 0 fires) due to the
> >>>> outstanding port references incurred by remaining posted recv
> >>>> buffers. The 1st 128 WorkRequests have been pulled from the CQ by
> >>>> __recv_cb_internal(), which then posts an IO work request to
> >>>> process the remaining 384 recv work requests. The IO work request
> >>>> does not run prior to port_detroy() being called.
> >>>>
> >>>> 2) The IO thread attempts to run but blows up (BSOD invalid memory
> >>>> reference) as port structures required by the IO work thread have
> >>>> been free()'ed.
> >>>>
> >>>> The fix is to recognize the port is not in the IB_QPS_RTS state,
> do
> >>>> not schedule an IO work thread request and continue to pull recv
> >>>> work requests from the CQ until empty.
> >>>>
> >>>> Code snippets:
> >>>>       } else {
> >>>>               if ( h_cq && p_port->state == IB_QPS_RTS ) {
> >>>>                       // increment reference to ensure no one
> >>>>                       release the object while work iteam is
> queued
> >>>>                       ipoib_port_ref( p_port, ref_recv_cb );
> >>>>                       IoQueueWorkItem( p_port->pPoWorkItem,
> >>>>                       __iopoib_WorkItem, DelayedWorkQueue,
> p_port);
> >>>>               WorkToDo = FALSE;               } else { WorkToDo =
> >>>> TRUE; } }
> >>>>
> >>>> __recv_cb(
> >>>>       IN              const   ib_cq_handle_t
> >>>>       h_cq, IN                              void
> >>>>       *cq_context ) { uint32_t        recv_cnt;
> >>>>       boolean_t       WorkToDo;
> >>>>
> >>>>       do
> >>>>       {
> >>>>               WorkToDo = __recv_cb_internal(h_cq, cq_context,
> >>>> &recv_cnt);       } while( WorkToDo ); }
> >>>>
> >>>>
> >>>> --- A/ulp/ipoib_ndis6_cm/kernel/ipoib_port.cpp        Mon Sep 13
> >>>> 15:58:08 2010 +++ B/ulp/ipoib_NDIS6_CM/kernel/ipoib_port.cpp   Mon
> >>>> Sep 20 08:47:08 2010 @@ -2222,7 +2222,7 @@
> >>>>               CL_ASSERT( status == IB_SUCCESS );
> >>>>
> >>>>       } else {
> >>>> -             if (h_cq) {
> >>>> +             if ( h_cq && p_port->state == IB_QPS_RTS ) {
> >>>>                       // increment reference to ensure no one
> >>>>                       release the object while work iteam is
> queued
> >>>>                       ipoib_port_ref( p_port, ref_recv_cb );
> >>>> IoQueueWorkItem( p_port->pPoWorkItem, __iopoib_WorkItem,
> >>>> DelayedWorkQueue, p_port); @@ -2244,9 +2244,13 @@
> >>>>       IN              const   ib_cq_handle_t
> >>>>       h_cq, IN                              void
> >>>> *cq_context )  { -     uint32_t recv_cnt;
> >>>> +     uint32_t        recv_cnt;
> >>>> +     boolean_t       WorkToDo;
> >>>>
> >>>> -     __recv_cb_internal(h_cq, cq_context, &recv_cnt); +     do +
> >>>> { +             WorkToDo = __recv_cb_internal(h_cq, cq_context,
> >>>>  &recv_cnt); +     } while( WorkToDo ); }
> >>
> >> _______________________________________________
> >> ofw mailing list
> >> ofw at lists.openfabrics.org
> >> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw