[ofw] [PATCH 3/7] ib/cm: export CM only interface

Tue Jan 20 17:04:07 PST 2009

>>> +/*
>>> + * Called after polling a MAD from a CEP to parse the received CM
>>> message
>>> + * into readable event data.
>>> + */
>>> +void
>>> +kal_cep_format_event(
>>> +       IN                              ib_al_handle_t
>>> h_al,
>>> +       IN                              net32_t
>>> cid,
>>> +       IN                              ib_mad_element_t
>>> *p_mad,
>>> +       IN      OUT                     iba_cm_event
>>> *p_event)
>>> +{
>>> +       KLOCK_QUEUE_HANDLE      hdl;
>>> +       kcep_t                          *p_cep;
>>> +
>>> +       KeAcquireInStackQueuedSpinLock( &gp_cep_mgr->lock, &hdl );
>>> +       p_cep = __lookup_cep( h_al, cid );
>>> +       KeReleaseInStackQueuedSpinLock( &hdl );
>>
>> You can't release the lock here.  The CEP could be freed before you
>> reference it below.
>
> I think we're safe here.  This function is only invoked from within a
> callback referencing the cep in question, which is why I also skipped
> checking that the pointer was valid.

So there's a requirement then that the CEP be in the callback then?  It might be nice to store the thread handle of the thread invoking the callback and assert in this function that the CEP is invoking a callback in this same thread.

>>> trunk\docs\dontdiff.txt -I '\$Id:' trunk/core/al/kernel/al_ndi_cm.c
>>> branches\winverbs/core/al/kernel/al_ndi_cm.c ---
>>> trunk/core/al/kernel/al_ndi_cm.c    2008-10-29 11:23:54.390625000 -
>>> 0700 +++ branches\winverbs/core/al/kernel/al_ndi_cm.c
>>> 2009-01-12 01:53:28.062500000 -0800 @@ -1469,24 +1469,20 @@ static
>>> void  __ndi_fill_cm_rep(
>>>         IN              ib_qp_handle_t  const h_qp, IN
>>>         ual_ndi_rep_cm_ioctl_in_t
>>> *p_rep,
>>> -               OUT     ib_cm_rep_t
>>> *p_cm_rep)
>>> +               OUT     iba_cm_rep
>>> *p_cm_rep)
>>>  {
>>>         AL_ENTER( AL_DBG_NDI );
>>>
>>> -       memset( p_cm_rep, 0, sizeof(ib_cm_rep_t) );
>>> +       memset( p_cm_rep, 0, sizeof(iba_cm_rep) );
>>  Why not just use sizeof(*p_cm_rep) to avoid this if things change
>> again in the future?
>
> I don't think this works with the current compiler.

It should, as it's being done in other parts of the code (HCA driver, IBAT library, FMR pool in IBAL, OpenSM)

>>> -       p_cm_rep->p_rep_pdata = p_rep->pdata;
>>> -       p_cm_rep->rep_length = sizeof(p_rep->pdata);
>>> +       p_cm_rep->p_pdata = p_rep->pdata;
>>> +       p_cm_rep->pdata_len = sizeof(p_rep->pdata);
>>>
>>> -       p_cm_rep->qp_type = IB_QPT_RELIABLE_CONN;
>>> -       p_cm_rep->h_qp = h_qp;
>>> +       p_cm_rep->qpn = h_qp->num;
>>>
>>> -       p_cm_rep->access_ctrl = IB_AC_RDMA_READ | IB_AC_RDMA_WRITE |
>>> IB_AC_LOCAL_WRITE;
>>>         p_cm_rep->init_depth = p_rep->init_depth; -
>>>         p_cm_rep->target_ack_delay = 10; p_cm_rep->failover_accepted =
>>>         IB_FAILOVER_ACCEPT_UNSUPPORTED; p_cm_rep->flow_ctrl = TRUE;
>>>          /* HCAs must support end-to-
> end
>>> flow control. */
>>> -       p_cm_rep->rnr_nak_timeout = QP_ATTRIB_RNR_NAK_TIMEOUT;
>>>         p_cm_rep->rnr_retry_cnt = QP_ATTRIB_RNR_RETRY;
>>>
>>>         AL_EXIT( AL_DBG_NDI ); @@ -1499,7 +1495,7 @@ __ndi_send_rep(
>>>         IN              PIRP
>>> p_irp )
>>>  {
>>>         IO_STACK_LOCATION       *p_io_stack;
>>> -       ib_cm_rep_t cm_rep;
>>> +       iba_cm_rep cm_rep;
>>>         ib_qp_mod_t qp_mod;
>>>         ib_api_status_t status;
>>>         ual_ndi_rep_cm_ioctl_in_t *p_rep = @@ -1524,13 +1520,18 @@
>>> __ndi_send_rep(
>>>
>>>         /* Format ib_cm_req_t structure */
>>>         __ndi_fill_cm_rep( h_qp, p_rep, &cm_rep );
>>> +       qp_mod.state.rtr.rnr_nak_timeout = QP_ATTRIB_RNR_NAK_TIMEOUT;
>>>
>>>         ref_al_obj( &h_qp->obj ); /* Take CEP reference. */
>>>
>>>         /* prepare Passive CEP for connection */
>>> -       status = al_cep_pre_rep_ex(
>>> -               qp_get_al( h_qp ), p_rep->cid, __ndi_cm_handler,
>>> -               h_qp, deref_al_obj, &cm_rep,
>>> -               &((al_conn_qp_t*)h_qp)->cid, &qp_mod );
>>> +       status =
>>> +           kal_config_cep(qp_get_al( h_qp ), p_rep->cid,
>>> +           __ndi_cm_handler, h_qp, deref_al_obj);
>>  The al_cep_pre_rep_ex verified that the QP's CID was available,
>> preventing a QP that was in process of being destroyed, or one that was
>> already in a connection, from being used.  Basically the original
>> function atomically associated the QP with the CEP (through the CID
>> member), and the CEP with the QP (through the CEP context).
>
> What exactly is wrong with the changed code?

1. You assign the CID in the QP after the QP might have initiated destruction, leaking the CEP until the app exits.
2. You don't check that the QP isn't already connected.

> The al_cep_pre_rep_ex function didn't have any notion that it was
> dealing with a QP's CID.  It took as input a CID, a pointer to a CID
> that had to be set to INVALID, and on output copied the input CID to
> the location referenced.  The behavior of such an interface doesn't
> make much sense.  The function is trying to do too much.

All the CEP functions that potentially modify the CID stored in the QP take it as an in/out parameter (al_create_cep, al_destroy_cep, al_cep_pre_rep, al_cep_pre_rep_ex).  This was necessary to work around races with setting the CID value that couldn't be done with atomics.

It's fine to change the behavior as long as you don't break the functionality.  You might be able to take the QP object's lock to protect the QP's CID parameter, but then the other functions that take the CID by reference should also change.

>>> +       if ( status == IB_SUCCESS )
>>> +       {
>>> +               status = kal_cep_pre_rep(qp_get_al( h_qp ),
>>> + p_rep->cid,
>>> +
>>> &cm_rep, &qp_mod );
>>> +       }
>>>         if( status != IB_SUCCESS )
>>>         {
>>>                 IoFreeWorkItem(
>>> p_irp->Tail.Overlay.DriverContext[1]
>>> ); @@ -1539,7 +1540,7 @@ __ndi_send_rep(
>>>                 al_destroy_cep( qp_get_al( h_qp ), &p_rep->cid,
> FALSE
>>> );
>>>                 deref_al_obj( &h_qp->obj ); /* Release CEP
> reference.
>>  This dereference isn't always appropriate anymore.  If kal_config_cep
>> succeeded then the reference on the QP will be released when the CEP is
>> destroyed.  If kal_config_cep failed then the CEP isn't bound to the
> QP.
>
> This is confusing.  I think this may fix it:

<snip> I don't think either of those code snippets works.

> But I can't follow the referencing counting enough to know if the
> second option would work.

You must take the reference on the QP for the CID assignment before you assign the CID, so that if the QP is destroyed you never risk the ref count hitting zero when you're still dealing with the object.  If assigning the CID failed, you destroy the CEP and release the reference.  If the assignment succeeded (kal_config_cep succeeds) then you don't release the reference because it will be released when the CEP is destroyed.  So once you've succeeded calling kal_config_cep you don't release the reference, you only destroy the CEP.

>
>>> */
>>>                 AL_PRINT_EXIT( TRACE_LEVEL_ERROR, AL_DBG_ERROR,
>>> -                       ("al_cep_pre_rep_ex returned %s.\n",
>>> ib_get_err_str( status )) );
>>> +                       ("kal_cep_pre_rep returned %s.\n",
>>> ib_get_err_str( status )) );
>>>                 switch (status)
>>>                 {
>>>                         case IB_INVALID_HANDLE:
>>> @@ -1552,6 +1553,10 @@ __ndi_send_rep(
>>>                                 return ib_to_ntstatus( status );
>>>                 }
>>>         }
>>> +       else
>>> +       {
>>> +               ((al_conn_qp_t*)h_qp)->cid = p_rep->cid;
>>
>> You leak CEPs here if the QP gets destroyed by another thread while
>> in this call.  The QP won't be freed due to the reference held for
>> the current IOCTL, but the CID member could get set after the
>> 'destroying_qp' function runs and checks the CID.  This leaks a CEP
>> until the application exits.  I don't know off the top of my head if
>> there are other races introduced by this.  I think the really nasty
>> destruction race is still covered (see revision 1356 if you're
>> curious.)
>>
>> Adding the in/out parameter for the QP's assigned CID back into the
>> call would fix all these issues.
>
> The proper fix for this (short of moving ND above winverbs) is to
> block destruction of the QP while in this call (or any other IOCTL
> call that references the QP).

Blocking destruction can lead to other problems (not to mention imposing restrictions on the caller) unless you allow destruction to fail if the object is in use (but then the recovery logic is heinous too).  The QP won't get destroyed until all IOCTLs have returned, and all references (from async operations for example) have also been released.  However there is code that initiates destruction, and that is what needs serialization (and then only on the members that have potential contention), so that a CEP is destroys and aborts any connection establishment that may be in progress in order to give a timely destruction of the QP.

In WinVerbs, if you destroy an endpoint, does it disconnect if it's connected, or do you not track the CEP to QP relationship?

> The in/out parameter to copy the CID is
> kludgy and pushes a solution to a personal reference count problem
> down to the lowest level CM code.

It was moved down because the CM already provided the proper level of synchronization, which was known not to cause deadlocks.  Using the QP object's lock may or may not work.  We could add a CID lock to the QP and then know that there won't be deadlocking.

Note that the CID assignment issue was not an NetworkDirect issue - it affected SRP, see the mail thread titled "OFED 1.3/WinOF 1.1/Win2k3R2X64 BSOD" for details (6/30/2008 or so).

> Looking at just the __ndi_send_rep call, it takes as input the QP
> handle, meaning a reference to the handle must exist.

Yes, that reference is taken when the handle is validated in ndi_rep_cm.

> The function then acquires 2 more references on that handle.

It takes a reference for the work item, which will get queued and processed asynchronously, after the IOCTL handler has returned.  Not sure how you'd handle this, but taking the reference explicitly makes it much easier to see where/why references are taken.  Handing off references taken in one place for one purpose to another place for another purpose makes reference counting very hard to debug.  In this case, if you have an work item allocated, you take a reference on the object that the work item will reference.  Yes, the code should probably have delayed taking the reference until the work item was queued rather than when it was allocated.

Then there's a reference for the CEP, since destroying the CEP will call deref_al_obj.  I don't see what the problem is here.  The IOCTL unwinds and returns to the caller without blocking for the RTU, but isn't completed until the RTU is received.

> But wait!  That's not
> all!  If you order now, ndi_rep_cm, which calls __ndi_send_rep, takes
> another reference.

Yes, ndi_rep_cm is the IOCTL handler.  It takes a reference on the QP during handle validation.  This is a good thing, preventing the object from being freed while in use (prevents crashing the system).

> But we're not done yet!  __ndi_rep_cm, which calls
> ndi_rep_cm, also takes a reference on the QP.

The ndi_rep_cm function queues the IRP in the QP's IRP queue.  You don't want the QP to be freed until the IRP is completed.  Note that the IRP could get removed from the IRP queue through cancelation, but the cancelation routine accesses the QP to properly clean things up. So the reference is needed here again, unless you want the cancel routine to potentially access a freed QP.

>  That's right, for the
> one low cost of only 1 IOCTL, we'll give you up to 4 references on the
> same QP!

Ooooo, those 4 atomic increments are *so* expensive.

> I'm not surprised at all that 'nasty destruction' races exist in the
> code.

The races existed before NetworkDirect support was added.

> Between the arbitrary reference counting and cleanup/destroy
> callback destruction model, I doubt that anyone can figure out what the
> code is doing. sigh...

Just because you don't understand it doesn't make it arbitrary. ;)  Trying to minimize the references would yield much harder to understand code.

> Back to the actual code... the main difference between my change and
> the existing code is that the cid was copied while holding the CEP
> manager's lock in the old code.  However, I don't see how that
> matters, since the QP CID is referenced during destruction without
> that lock being held anyway.  (I don't see where the QP CID is
> obviously referenced under any sort of lock, but I only did a quick
> scan of the code.)  I'm not sure any new races are added.

See the list of 4 functions that only modify the CID if it has particular states.  You can create a lock to protect the CID member of the QP if you'd like to move it out of the CEP.  You can't however just move it out without providing the same checks/synchronization.

> A more substantial fix is probably needed when getting/setting the
> QP's CID value.  One that includes proper synchronization and can
> verify that the QP's CID is valid based on some state.

That's what the code does today.  The CID has one initial value that indicates that it is invalid, but available: AL_INVALID_CID.  Then there's another special value that indicates that the CID is not valid, and cannot be used: AL_RESERVED_CID.  Then there are valid CID values that map to CEPs.  The locking is provided by the CEP manager.

You could break the state out as a different member in the QP, along with a lock, and accomplish the same thing.  You'd have to use a spinlock though as the CEP functions can all be called at dispatch level.  If you want to have blocking destruction, you can signal an event from the CEP's destroy callback.