[ofw] [RFC] ib cm: export CM only interface

Tue Nov 18 13:00:32 PST 2008

>There's already a function that converts ib_status values to NTSTATUS ones,
>that handles all ib_status values.  I don't have the code handy, but I believe
>it is in al_dev.c, and I don't remember the name offhand, but it had the word
>'ntstatus' in it so a search should be fruitful.

I will look for it.

>> static void
>> cm_cep_handler(const ib_al_handle_t h_al, const net32_t cid)
>> {
>>         void                            *context;
>>         net32_t                         new_cid;
>>         ib_mad_element_t        *mad;
>>         iba_cm_id                       *id, *listen_id;
>>
>>         while (al_cep_poll(h_al, cid, &context, &new_cid, &mad) ==
>> IB_SUCCESS) {
>>
>>                 if (new_cid == AL_INVALID_CID) {
>>                         id = (iba_cm_id *) context;
>>                 } else {
>>                         listen_id = (iba_cm_id *) context;
>>
>>                         id = ExAllocatePoolWithTag(NonPagedPool,
>> sizeof(iba_cm_id), 'mcbi');
>>                         if (id == NULL) {
>>                                 al_destroy_cep(gh_al, &new_cid, FALSE);
>
>Note that all new CEPs should probably be created in the CEP manager with a
>NULL callback.  Since the CEPs inherit the listen CEPs callback, I think it's
>possible a callback for a new CEP would be invoked (say a REJ due to timeout)
>before the new CEP was retrieved.  If the callback pointer was NULL until the
>REP call, you would be safe.

The callbacks for a CEP should be serialized, or it's extremely difficult to
recover from an error.  Trying to use the callback pointer won't work.  If the
user sets the callback pointer from within a callback, then they will still get
a second callback on the same CEP.  The easiest solution for the REJ case is to
just drop the MAD.  If the user tries to send a REP, it will just be rejected at
that point.  This situation should not be common in practice anyway.

>>                                 ib_put_mad(mad);
>>                                 continue;
>>                         }
>>
>>                         id->context = listen_id;
>>                         id->callback = listen_id->callback;
>>                         id->cid = new_cid;
>>                 }
>>
>>                 id->callback(id, mad->p_mad_buf);
>
>How does someone that's get the new CM ID before calling iba_cm_rep?

I'm not following you.  'id' here is the new cm_id.

>
>What happens if they need the MAD contents in a different thread context - do
>they have to allocate/copy?  Why not just hand them the MAD and have them be
>responsible for freeing it - this lets them store it if they need it while
>changing to a passive level thread context (if they need it).  I think this
>would be better, even if you end up with a wrapper for ib_put_mad.

Unless a user needs to store the MAD, I plan on freeing it after the callback.
This may change once more of the kernel code is written, and I know if the MAD
needs to be kept by the user.  What I don't want is for users to have to queue
MADs.

>How is your IOCTL interface going to work?  Will it have an event that will
>give it the MAD too?  Will the user-mode library be callback driven, or event
>driven?

The user-mode library does not have threads.  There have been some changes made
to the CM portion of the WinVerbs API, but those have dealt with exchanging
address information.

>The al_cep_get_pdata function was added to so that the private data could be
>retrieved after a REQ received, but in an entirely different call context.  In
>the ND case (and I think what you defined for WinVerbs), the client gets an
>event on their listen object that completes directly to the user (Win32
>overlapped operation).  The client then needs to retrieve the information from
>the received MAD (private data, responder resources, initiator depth) and this
>was done via al_cep_get_pdata.

WinVerbs defines a Query() routine to get the current endpoint attributes.  Only
whatever private data was last received is maintained.

>> static NTSTATUS
>> cm_create_id(void (*callback)(iba_cm_id *p_id, ib_mad_t *p_mad),
>>                          void *context, iba_cm_id **pp_id)
>> {
>>         iba_cm_id               *id;
>>         ib_api_status_t ib_status;
>>
>>         id = ExAllocatePoolWithTag(NonPagedPool, sizeof(iba_cm_id),
>> 'mcbi');
>>         if (id == NULL) {
>>                 return STATUS_NO_MEMORY;
>>         }
>>
>>         id->callback = callback;
>>         id->context = context;
>>
>>         ib_status = al_create_cep(gh_al, cm_cep_handler, id, NULL, &id-
>> >cid);
>
>You'll probably want a destroy callback here, so that you can either block or
>release a reference on your ID structure when you destroy its underlying CEP.

I want 'no callback' to indicate that the destruction should be synchronous.  (I
thought the al_obj stuff did this.)  When cm_destroy_id returns, no callbacks
should be received by the user.  Handling device removal is difficult without
this.  (Heck, it's difficult with it.)  There is some synchronization between
the callback threads and destruction already, just not sure if it's sufficient.

>> static void
>> cm_destroy_id(iba_cm_id *p_id)
>> {
>>         al_destroy_cep(gh_al, &p_id->cid, FALSE);
>
>The al_destroy_cep function does not block, so you could receive a callback
>after you free the ID.  You need a way to mark the ID freed so that the handler
>doesn't invoke the callback.  You then need to do reference counting on your ID
>structures so that they can be freed after the CEP manager is done with them.
>Alternatively, you can allocate an event/block until the CEP is freed (your
>destroy callback is invoked).

What is the threading at the MAD level calling back to the CM?  Is there a
single dispatch thread?  Several?

>>         ExFreePool(p_id);
>> }
>
><snip...>
>
>> static NTSTATUS
>> cm_get_qp_attr(iba_cm_id *p_id, ib_qp_state_t state, ib_qp_mod_t
>> *p_attr)
>> {
>>         ib_api_status_t ib_status;
>>
>>         switch (state) {
>>         case IB_QPS_INIT:
>>                 ib_status = al_cep_get_init_attr(gh_al, p_id->cid,
>> p_attr);
>>                 break;
>>         case IB_QPS_RTR:
>>                 ib_status = al_cep_get_rtr_attr(gh_al, p_id->cid,
>> p_attr);
>
>How will you handle the passive side accepting?  The QP attributes can be
>changed by sending the REP - the CEP manager splits the REP into a pre- and
>send- calls so that the updates to the CEP's QP attributes can happen in the
>pre- call (which returns the attributes), so the client can do the RTR
>transition, and then the REP can be sent by the send- call.  Is the expectation
>here that the client will call get_rtr_attr, make the changes they intend to
>make with the REP manually in the returned QP attribute structure, then call
>iba_cm_rep?

The passive side needs to adjust any QP attributes before calling modify.  If I
recall correctly, the only missing data is the responder resources (for RTR) and
initiator depth (for RTS).  From a user's perspective, the pre calls are
replaced with just setting these two fields.

>
>>                 break;
>>         case IB_QPS_RTS:
>>                 ib_status = al_cep_get_rts_attr(gh_al, p_id->cid,
>> p_attr);
>>                 break;
>>         default:
>>                 return STATUS_INVALID_PARAMETER;
>>         }
>>
>>         return convert_ib_status(ib_status);
>> }