[ofw] [PATCH 3/7] ib/cm: export CM only interface

Tue Jan 20 11:59:23 PST 2009

>> +static NTSTATUS
>> +cm_send_req(iba_cm_id *p_id, iba_cm_req *p_req)
>> +{
>> +       ib_qp_mod_t             attr;
>> +       ib_api_status_t ib_status;
>> +
>> +       attr.state.rtr.rnr_nak_timeout = 0;
>
>Why set rnr_nak_timeout?

The new CM interface does not expose this value, since it's not carried in any
CM message.  However, the existing CM code expects that the value will be
provided, so I didn't want to try to remove it completely.

>> +       ib_status = kal_cep_pre_req(gh_al, p_id->cid, p_req, &attr);
>
>So the attr parameter is marked as in/out, but it only seems to be used as an
>input parameter, and then only the rnr_nak_timeout is used - why not just pass
>in the rnr_nak_timeout directly rather than via the structure?

The attributes are output from the underlying CEP code and are used by other
callers.

I'll look at adding the rnr_nak_timeout as a separate parameter.  It shouldn't
be too hard. 

>> +       if (ib_status != IB_SUCCESS) {
>> +               return ib_to_ntstatus(ib_status);
>> +       }
>> +
>> +       ib_status = al_cep_send_req(gh_al, p_id->cid);
>> +       return ib_to_ntstatus(ib_status);
>> +}
>
><snip>
>
>> +static NTSTATUS
>> +cm_send_mra(iba_cm_id *p_id, uint8_t service_timeout,
>> +                       void *p_pdata, uint8_t pdata_len)
>> +{
>> +       ib_cm_mra_t             mra;
>> +       ib_api_status_t ib_status;
>> +
>> +       mra.svc_timeout = service_timeout;
>> +       mra.p_mra_pdata = p_pdata;
>> +       mra.mra_length = pdata_len;
>> +
>> +       ib_status = al_cep_mra(gh_al, p_id->cid, &mra);
>
>Why not change the al_cep_mra function to take these parameters explicitly
>rather than through the ib_cm_mra_t structure?

I wasn't wanting to completely rework all areas of the existing CM code.  This
can all be changed later, since the calls are only used within the existing
driver.  I changed the req/rep calls in order to use the caller's PSN values,
rather than the CMs.

>> +static NTSTATUS
>> +cm_send_lap(iba_cm_id *p_id, iba_cm_lap *p_lap)
>> +{
>> +       ib_cm_lap_t             lap;
>> +       ib_api_status_t ib_status;
>> +
>> +       RtlZeroMemory(&lap, sizeof lap);
>> +       lap.p_lap_pdata = p_lap->p_pdata;
>> +       lap.lap_length = p_lap->pdata_len;
>> +       lap.remote_resp_timeout = p_lap->remote_resp_timeout;
>> +       lap.p_alt_path = p_lap->p_alt_path;
>> +
>> +       ib_status = al_cep_lap(gh_al, p_id->cid, &lap);
>
>Same here, just change al_cep_lap to take the params explicitly.  We expect the
>common case going forward to be the RDMA CM, so we should optimize for that.

I'd prefer to wait on these optimizations and submit them separately.  I'll add
notes to change these, but I'm not sure there's anything that will test the MRA,
LAP, APR calls.  Also, note that the al_cep_* calls are shared between userspace
and the kernel.

>> +       return ib_to_ntstatus(ib_status);
>> +}
>> +
>> +static NTSTATUS
>> +cm_send_apr(iba_cm_id *p_id, iba_cm_apr *p_apr)
>> +{
>> +       ib_cm_apr_t             apr;
>> +       ib_qp_mod_t             attr;
>> +       ib_api_status_t ib_status;
>> +
>> +       RtlZeroMemory(&apr, sizeof apr);
>> +       apr.p_apr_pdata = p_apr->p_pdata;
>> +       apr.apr_length = p_apr->pdata_len;
>> +       apr.apr_status = p_apr->status;
>> +       apr.info_length = p_apr->info_length;
>> +       apr.p_info = p_apr->p_info;
>> +
>> +       ib_status = al_cep_pre_apr(gh_al, p_id->cid, &apr, &attr);
>
>Make attr an optional parameter.
>
>I would change al_cep_pre_apr to take the iba_cm_apr structure rather than the
>ib_cm_apr_t structure.
>
><snip>
>
>> diff -up -r -X trunk\docs\dontdiff.txt -I '\$Id:'
>> trunk/core/al/kernel/al_cm_cep.c
>> branches\winverbs/core/al/kernel/al_cm_cep.c
>> --- trunk/core/al/kernel/al_cm_cep.c    2008-07-16 08:53:56.469125000 -
>> 0700
>> +++ branches\winverbs/core/al/kernel/al_cm_cep.c        2009-01-13
>> 23:13:27.250000000 -0800
>> @@ -3650,6 +3652,60 @@ __bind_cep(
>>         cl_spinlock_release( &h_al->obj.lock );
>>  }
>>
>> +ib_api_status_t
>
>Can you make this return NTSTATUS?  The more of these calls you can switch to
>NTSTATUS the better IMO.  Eventually we should get rid of the ib_status_t
>altogether (already done for the cl_status_t)

Hmm... I'm not sure that we want the CEP code to have functions that return
different types of status values.  I kept it NTSTATUS at the upper level.  I
would have a separate patch rework the status values at the CEP level.

Basically, there's a lot of cleanup that could be done at the CEP level, and I
would rather make those changes using stepwise refinement.  This includes
separating the code into userspace only or kernel only code.

>> +kal_alloc_cep(
>> +       IN                              ib_al_handle_t
>> h_al,
>> +       IN      OUT                     net32_t* const
>> p_cid )
>> +{
>> +       kcep_t                          *p_cep;
>> +       KLOCK_QUEUE_HANDLE      hdl;
>> +       ib_api_status_t         status = IB_SUCCESS;
>
>You don't need status as a variable at all.

ok

>> +
>> +       KeAcquireInStackQueuedSpinLock( &gp_cep_mgr->lock, &hdl );
>> +       p_cep = __create_cep();
>> +       KeReleaseInStackQueuedSpinLock( &hdl );
>> +
>> +       if( p_cep )
>> +       {
>> +               __bind_cep(p_cep, h_al, NULL, NULL);
>> +               *p_cid = p_cep->cid;
>> +       }
>> +       else
>> +       {
>> +               status = IB_INSUFFICIENT_MEMORY;
>
>Just return in the error case (return early).  Allows outdenting the body of
>the if above.
>
>> +       }
>> +
>> +       return status;
>> +}
>> +
>> +ib_api_status_t
>> +kal_config_cep(
>> +       IN                              ib_al_handle_t
>> h_al,
>> +       IN                              net32_t
>> cid,
>> +       IN                              al_pfn_cep_cb_t
>> pfn_cb,
>> +       IN                              void*
>> context,
>> +       IN                              ib_pfn_destroy_cb_t
>> pfn_destroy_cb )
>> +{
>> +       kcep_t                          *p_cep;
>> +       KLOCK_QUEUE_HANDLE      hdl;
>> +       ib_api_status_t         status = IB_SUCCESS;
>> +
>> +       KeAcquireInStackQueuedSpinLock( &gp_cep_mgr->lock, &hdl );
>> +       p_cep = __lookup_cep( h_al, cid );
>> +       if (!p_cep )
>> +       {
>> +               status = IB_INVALID_PARAMETER;
>
>Most of the other calls return INVALID_HANDLE here.

I'll fix this.

>> +               goto out;
>> +       }
>> +
>> +       p_cep->pfn_cb = pfn_cb;
>> +       p_cep->context = context;
>> +       p_cep->pfn_destroy_cb = pfn_destroy_cb;
>> +
>> +out:
>> +       KeReleaseInStackQueuedSpinLock( &hdl );
>> +       return status;
>> +}
>>
>>  static inline void
>>  __unbind_cep(
>
><snip>
>
>> +/*
>> + * Called after polling a MAD from a CEP to parse the received CM
>> message
>> + * into readable event data.
>> + */
>> +void
>> +kal_cep_format_event(
>> +       IN                              ib_al_handle_t
>> h_al,
>> +       IN                              net32_t
>> cid,
>> +       IN                              ib_mad_element_t
>> *p_mad,
>> +       IN      OUT                     iba_cm_event
>> *p_event)
>> +{
>> +       KLOCK_QUEUE_HANDLE      hdl;
>> +       kcep_t                          *p_cep;
>> +
>> +       KeAcquireInStackQueuedSpinLock( &gp_cep_mgr->lock, &hdl );
>> +       p_cep = __lookup_cep( h_al, cid );
>> +       KeReleaseInStackQueuedSpinLock( &hdl );
>
>You can't release the lock here.  The CEP could be freed before you reference
>it below.

I think we're safe here.  This function is only invoked from within a callback
referencing the cep in question, which is why I also skipped checking that the
pointer was valid.

>> +
>> +       switch (p_mad->p_mad_buf->attr_id) {
>> +       case CM_REQ_ATTR_ID:
>> +               if (p_mad->status == IB_SUCCESS) {
>> +                       p_event->type = iba_cm_req_received;
>> +                       __format_event_req(p_cep, (mad_cm_req_t*)
>> p_mad->p_mad_buf, &p_event->data.req);
>> +               } else {
>> +                       p_event->type = iba_cm_req_error;
>> +               }
>> +               break;
>> +       case CM_REP_ATTR_ID:
>> +               if (p_mad->status == IB_SUCCESS) {
>> +                       p_event->type = iba_cm_rep_received;
>> +                       __format_event_rep((mad_cm_rep_t*) p_mad-
>> >p_mad_buf, &p_event->data.rep);
>> +               } else {
>> +                       p_event->type = iba_cm_rep_error;
>> +               }
>> +               break;
>> +       case CM_RTU_ATTR_ID:
>> +               p_event->type = iba_cm_rtu_received;
>> +               __format_event_pdata(p_cep, &p_event->data.rtu);
>> +               break;
>> +       case CM_DREQ_ATTR_ID:
>> +               if (p_mad->status == IB_SUCCESS) {
>> +                       p_event->type = iba_cm_dreq_received;
>> +                       __format_event_pdata(p_cep, &p_event-
>> >data.dreq);
>> +               } else {
>> +                       p_event->type = iba_cm_dreq_error;
>> +               }
>> +               break;
>> +       case CM_DREP_ATTR_ID:
>> +               p_event->type = iba_cm_drep_received;
>> +               __format_event_pdata(p_cep, &p_event->data.drep);
>> +               break;
>> +       case CM_REJ_ATTR_ID:
>> +               p_event->type = iba_cm_rej_received;
>> +               __format_event_rej((mad_cm_rej_t*) p_mad->p_mad_buf,
>> &p_event->data.rej);
>> +               break;
>> +       case CM_MRA_ATTR_ID:
>> +               p_event->type = iba_cm_mra_received;
>> +               __format_event_mra((mad_cm_mra_t*) p_mad->p_mad_buf,
>> &p_event->data.mra);
>> +               break;
>> +       case CM_LAP_ATTR_ID:
>> +               if (p_mad->status == IB_SUCCESS) {
>> +                       p_event->type = iba_cm_lap_received;
>> +                       // TODO: format lap event
>> +               } else {
>> +                       p_event->type = iba_cm_lap_error;
>> +               }
>> +               break;
>> +       case CM_APR_ATTR_ID:
>> +               p_event->type = iba_cm_apr_received;;
>> +               // TODO: format apr event
>> +               break;
>> +       case CM_SIDR_REQ_ATTR_ID:
>> +               if (p_mad->status == IB_SUCCESS) {
>> +                       p_event->type = iba_cm_sidr_req_received;
>> +                       // TODO: format sidr req event
>> +               } else {
>> +                       p_event->type = iba_cm_sidr_req_error;
>> +               }
>> +               break;
>> +       case CM_SIDR_REP_ATTR_ID:
>> +               p_event->type = iba_cm_sidr_rep_received;
>> +               // TODO: format sidr rep event
>> +               break;
>> +       default:
>> +               CL_ASSERT(0);
>> +       }
>> +}
>> diff -up -r -X trunk\docs\dontdiff.txt -I '\$Id:'
>> trunk/core/al/kernel/al_ndi_cm.c
>> branches\winverbs/core/al/kernel/al_ndi_cm.c
>> --- trunk/core/al/kernel/al_ndi_cm.c    2008-10-29 11:23:54.390625000 -
>> 0700
>> +++ branches\winverbs/core/al/kernel/al_ndi_cm.c        2009-01-12
>> 01:53:28.062500000 -0800
>> @@ -1469,24 +1469,20 @@ static void
>>  __ndi_fill_cm_rep(
>>         IN              ib_qp_handle_t  const
>> h_qp,
>>         IN              ual_ndi_rep_cm_ioctl_in_t
>> *p_rep,
>> -               OUT     ib_cm_rep_t
>> *p_cm_rep)
>> +               OUT     iba_cm_rep
>> *p_cm_rep)
>>  {
>>         AL_ENTER( AL_DBG_NDI );
>>
>> -       memset( p_cm_rep, 0, sizeof(ib_cm_rep_t) );
>> +       memset( p_cm_rep, 0, sizeof(iba_cm_rep) );
>
>Why not just use sizeof(*p_cm_rep) to avoid this if things change again in the
>future?

I don't think this works with the current compiler.

>> -       p_cm_rep->p_rep_pdata = p_rep->pdata;
>> -       p_cm_rep->rep_length = sizeof(p_rep->pdata);
>> +       p_cm_rep->p_pdata = p_rep->pdata;
>> +       p_cm_rep->pdata_len = sizeof(p_rep->pdata);
>>
>> -       p_cm_rep->qp_type = IB_QPT_RELIABLE_CONN;
>> -       p_cm_rep->h_qp = h_qp;
>> +       p_cm_rep->qpn = h_qp->num;
>>
>> -       p_cm_rep->access_ctrl = IB_AC_RDMA_READ | IB_AC_RDMA_WRITE |
>> IB_AC_LOCAL_WRITE;
>>         p_cm_rep->init_depth = p_rep->init_depth;
>> -       p_cm_rep->target_ack_delay = 10;
>>         p_cm_rep->failover_accepted = IB_FAILOVER_ACCEPT_UNSUPPORTED;
>>         p_cm_rep->flow_ctrl = TRUE;     /* HCAs must support end-to-end
>> flow control. */
>> -       p_cm_rep->rnr_nak_timeout = QP_ATTRIB_RNR_NAK_TIMEOUT;
>>         p_cm_rep->rnr_retry_cnt = QP_ATTRIB_RNR_RETRY;
>>
>>         AL_EXIT( AL_DBG_NDI );
>> @@ -1499,7 +1495,7 @@ __ndi_send_rep(
>>         IN              PIRP
>> p_irp )
>>  {
>>         IO_STACK_LOCATION       *p_io_stack;
>> -       ib_cm_rep_t cm_rep;
>> +       iba_cm_rep cm_rep;
>>         ib_qp_mod_t qp_mod;
>>         ib_api_status_t status;
>>         ual_ndi_rep_cm_ioctl_in_t *p_rep =
>> @@ -1524,13 +1520,18 @@ __ndi_send_rep(
>>
>>         /* Format ib_cm_req_t structure */
>>         __ndi_fill_cm_rep( h_qp, p_rep, &cm_rep );
>> +       qp_mod.state.rtr.rnr_nak_timeout = QP_ATTRIB_RNR_NAK_TIMEOUT;
>>
>>         ref_al_obj( &h_qp->obj ); /* Take CEP reference. */
>>
>>         /* prepare Passive CEP for connection */
>> -       status = al_cep_pre_rep_ex(
>> -               qp_get_al( h_qp ), p_rep->cid, __ndi_cm_handler, h_qp,
>> deref_al_obj,
>> -               &cm_rep, &((al_conn_qp_t*)h_qp)->cid, &qp_mod );
>> +       status = kal_config_cep(qp_get_al( h_qp ), p_rep->cid,
>> __ndi_cm_handler,
>> +                                                       h_qp,
>> deref_al_obj);
>
>The al_cep_pre_rep_ex verified that the QP's CID was available, preventing a QP
>that was in process of being destroyed, or one that was already in a
>connection, from being used.  Basically the original function atomically
>associated the QP with the CEP (through the CID member), and the CEP with the
>QP (through the CEP context).

What exactly is wrong with the changed code?

The al_cep_pre_rep_ex function didn't have any notion that it was dealing with a
QP's CID.  It took as input a CID, a pointer to a CID that had to be set to
INVALID, and on output copied the input CID to the location referenced.  The
behavior of such an interface doesn't make much sense.  The function is trying
to do too much. 

>
>> +       if ( status == IB_SUCCESS )
>> +       {
>> +               status = kal_cep_pre_rep(qp_get_al( h_qp ), p_rep->cid,
>> +
>> &cm_rep, &qp_mod );
>> +       }
>>         if( status != IB_SUCCESS )
>>         {
>>                 IoFreeWorkItem( p_irp->Tail.Overlay.DriverContext[1] );
>> @@ -1539,7 +1540,7 @@ __ndi_send_rep(
>>                 al_destroy_cep( qp_get_al( h_qp ), &p_rep->cid, FALSE
>> );
>>                 deref_al_obj( &h_qp->obj ); /* Release CEP reference.
>
>This dereference isn't always appropriate anymore.  If kal_config_cep succeeded
>then the reference on the QP will be released when the CEP is destroyed.  If
>kal_config_cep failed then the CEP isn't bound to the QP.

This is confusing.  I think this may fix it:

	ref_al_obj( &h_qp->obj );
	status = kal_config_cep(qp_get_al( h_qp ), p_rep->cid, __ndi_cm_handler,
							h_qp, deref_al_obj);
	if ( status == IB_SUCCESS )
	{
		status = kal_cep_pre_rep(qp_get_al( h_qp ), p_rep->cid,
								 &cm_rep,
&qp_mod );
	}
	else
	{
		al_destroy_cep( qp_get_al( h_qp ), &p_rep->cid, FALSE );
	}
	if( status != IB_SUCCESS )
	{
		IoFreeWorkItem( p_irp->Tail.Overlay.DriverContext[1] );
		p_irp->Tail.Overlay.DriverContext[1] = NULL;
		deref_al_obj( &h_qp->obj ); /* Release work item reference. */
		al_destroy_cep( qp_get_al( h_qp ), &p_rep->cid, FALSE );
		AL_PRINT_EXIT( TRACE_LEVEL_ERROR, AL_DBG_ERROR,
			("kal_cep_pre_rep returned %s.\n", ib_get_err_str(
status )));
		...

** or **

	status = kal_config_cep(qp_get_al( h_qp ), p_rep->cid, __ndi_cm_handler,
							h_qp, deref_al_obj);
	if ( status == IB_SUCCESS )
	{
		ref_al_obj( &h_qp->obj );
		status = kal_cep_pre_rep(qp_get_al( h_qp ), p_rep->cid,
								 &cm_rep,
&qp_mod );
	}

But I can't follow the referencing counting enough to know if the second option
would work.

>> */
>>                 AL_PRINT_EXIT( TRACE_LEVEL_ERROR, AL_DBG_ERROR,
>> -                       ("al_cep_pre_rep_ex returned %s.\n",
>> ib_get_err_str( status )) );
>> +                       ("kal_cep_pre_rep returned %s.\n",
>> ib_get_err_str( status )) );
>>                 switch (status)
>>                 {
>>                         case IB_INVALID_HANDLE:
>> @@ -1552,6 +1553,10 @@ __ndi_send_rep(
>>                                 return ib_to_ntstatus( status );
>>                 }
>>         }
>> +       else
>> +       {
>> +               ((al_conn_qp_t*)h_qp)->cid = p_rep->cid;
>
>You leak CEPs here if the QP gets destroyed by another thread while in this
>call.  The QP won't be freed due to the reference held for the current IOCTL,
>but the CID member could get set after the 'destroying_qp' function runs and
>checks the CID.  This leaks a CEP until the application exits.  I don't know
>off the top of my head if there are other races introduced by this.  I think
>the really nasty destruction race is still covered (see revision 1356 if you're
>curious.)
>
>Adding the in/out parameter for the QP's assigned CID back into the call would
>fix all these issues.

The proper fix for this (short of moving ND above winverbs) is to block
destruction of the QP while in this call (or any other IOCTL call that
references the QP).  The in/out parameter to copy the CID is kludgy and pushes a
solution to a personal reference count problem down to the lowest level CM code.

Looking at just the __ndi_send_rep call, it takes as input the QP handle,
meaning a reference to the handle must exist.  The function then acquires 2 more
references on that handle.  But wait!  That's not all!  If you order now,
ndi_rep_cm, which calls __ndi_send_rep, takes another reference.  But we're not
done yet!  __ndi_rep_cm, which calls ndi_rep_cm, also takes a reference on the
QP.  That's right, for the one low cost of only 1 IOCTL, we'll give you up to 4
references on the same QP!

I'm not surprised at all that 'nasty destruction' races exist in the code.
Between the arbitrary reference counting and cleanup/destroy callback
destruction model, I doubt that anyone can figure out what the code is doing.
sigh...

Back to the actual code... the main difference between my change and the
existing code is that the cid was copied while holding the CEP manager's lock in
the old code.  However, I don't see how that matters, since the QP CID is
referenced during destruction without that lock being held anyway.  (I don't see
where the QP CID is obviously referenced under any sort of lock, but I only did
a quick scan of the code.)  I'm not sure any new races are added.

A more substantial fix is probably needed when getting/setting the QP's CID
value.  One that includes proper synchronization and can verify that the QP's
CID is valid based on some state.

- Sean