***SPAM*** RE: [ofw] [RFC] ib cm: export CM only interface

Fab Tillier ftillier at windows.microsoft.com
Tue Nov 18 13:35:26 PST 2008


>>> static void
>>> cm_cep_handler(const ib_al_handle_t h_al, const net32_t cid) {
>>>         void                            *context;
>>>         net32_t                         new_cid;
>>>         ib_mad_element_t        *mad;
>>>         iba_cm_id                       *id, *listen_id;
>>>
>>>         while (al_cep_poll(h_al, cid, &context, &new_cid, &mad) ==
>>> IB_SUCCESS) {
>>>
>>>                 if (new_cid == AL_INVALID_CID) {
>>>                         id = (iba_cm_id *) context; } else { listen_id
>>>                         = (iba_cm_id *) context;
>>>
>>>                         id = ExAllocatePoolWithTag(NonPagedPool,
>>>                         sizeof(iba_cm_id), 'mcbi'); if (id == NULL) {
>>>                                 al_destroy_cep(gh_al, &new_cid,
>>> FALSE);
>>  Note that all new CEPs should probably be created in the CEP manager
>> with a NULL callback.  Since the CEPs inherit the listen CEPs callback,
>> I think it's possible a callback for a new CEP would be invoked (say a
>> REJ due to timeout) before the new CEP was retrieved.  If the callback
>> pointer was NULL until the REP call, you would be safe.
>
> The callbacks for a CEP should be serialized, or it's extremely
> difficult to recover from an error.

It should be per CEP - the MAD callback is done in the context of the QP1 receive CQ callback at DISPATCH_LEVEL.  You could have multiple callbacks for different CEPs if you have multiple ports active.  I don't remember off the top of my head if the QP1 manager has a CQ per direction, or a single CQ for send and receives.  If they're separate then you could have multiple callbacks (for different CEPs) simultaneously for a single port.

> Trying to use the callback
> pointer won't work.  If the user sets the callback pointer from within
> a callback, then they will still get a second callback on the same CEP.
> The easiest solution for the REJ case is to just drop the MAD.  If the
> user tries to send a REP, it will just be rejected at that point.
> This situation should not be common in practice anyway.

As long as you're in the callback you won't get another callback for that CEP.  If the callback returns then all bets are off.  That's why the CEP manager was extended to allow IRP driven notifications (no callbacks) so that these kinds or races are eliminated - notifications only happen when the client requests them.

>>>                                 ib_put_mad(mad);
>>>                                 continue;
>>>                         }
>>>
>>>                         id->context = listen_id;
>>>                         id->callback = listen_id->callback;
>>>                         id->cid = new_cid;
>>>                 }
>>>
>>>                 id->callback(id, mad->p_mad_buf);
>>
>> How does someone that's get the new CM ID before calling iba_cm_rep?
>
> I'm not following you.  'id' here is the new cm_id.

Duh, I'd missed that...

>>  What happens if they need the MAD contents in a different thread
>> context - do they have to allocate/copy?  Why not just hand them the
>> MAD and have them be responsible for freeing it - this lets them store
>> it if they need it while changing to a passive level thread context (if
>> they need it).  I think this would be better, even if you end up with
> a wrapper for ib_put_mad.
>
> Unless a user needs to store the MAD, I plan on freeing it after the
> callback. This may change once more of the kernel code is written, and I
> know if the MAD needs to be kept by the user.  What I don't want is for
> users to have to queue MADs.

That's a problem with your callback model.  If you let the client call down when ready to process the next MAD you'd be fine.  The CEP manager has to queue MADs already, so this wouldn't require much of a change.

That's precisely the reason the notify/poll model exists in the CEP manager - the poll just tells the client that it should poll.  When and how much to poll is left to the client (which is probably a better model if you want to support user-mode) and the client won't get another callback until the CEP is drained of all pending MADs.  How will you avoid having to queue MADs in your IOCTL handler?  Imagine the case of several connection requests coming in to a single listen.

>> How is your IOCTL interface going to work?  Will it have an event
>> that will give it the MAD too?  Will the user-mode library be
>> callback driven, or event driven?
>
> The user-mode library does not have threads.  There have been some
> changes made to the CM portion of the WinVerbs API, but those have
> dealt with exchanging address information.
>
>> The al_cep_get_pdata function was added to so that the private data
>> could be retrieved after a REQ received, but in an entirely different
>> call context.  In the ND case (and I think what you defined for
>> WinVerbs), the client gets an event on their listen object that
>> completes directly to the user (Win32 overlapped operation).  The
>> client then needs to retrieve the information from the received MAD
>> (private data, responder resources, initiator depth) and this was
>> done
> via al_cep_get_pdata.
>
> WinVerbs defines a Query() routine to get the current endpoint
> attributes.  Only whatever private data was last received is maintained.

The last received private data is already kept in the CEP manager.  Will you be reusing that capability?

>>> static NTSTATUS
>>> cm_create_id(void (*callback)(iba_cm_id *p_id, ib_mad_t *p_mad),
>>>                          void *context, iba_cm_id **pp_id) {
>>>         iba_cm_id               *id;
>>>         ib_api_status_t ib_status;
>>>
>>>         id = ExAllocatePoolWithTag(NonPagedPool, sizeof(iba_cm_id),
>>>         'mcbi'); if (id == NULL) {
>>>                 return STATUS_NO_MEMORY;
>>>         }
>>>
>>>         id->callback = callback;
>>>         id->context = context;
>>>
>>>         ib_status = al_create_cep(gh_al, cm_cep_handler, id, NULL,
>>> &id-
>>>> cid);
>>
>> You'll probably want a destroy callback here, so that you can either
>> block or release a reference on your ID structure when you destroy
>> its underlying CEP.
>
> I want 'no callback' to indicate that the destruction should be
> synchronous.  (I thought the al_obj stuff did this.)  When
> cm_destroy_id returns, no callbacks should be received by the user.
> Handling device removal is difficult without this.  (Heck, it's
> difficult with it.)  There is some synchronization between the
> callback threads and destruction already, just not sure if it's sufficient.

The CEP manager doesn't use the AL object stuff because it doesn't need it: the sync destroy wasn't needed since the QP/Listen already implemented it.  The QPs/Listens end up hanging out until the CEP is destroyed and invokes the destroy callback (which is just deref_al_obj).

>>> static void
>>> cm_destroy_id(iba_cm_id *p_id)
>>> {
>>>         al_destroy_cep(gh_al, &p_id->cid, FALSE);
>>  The al_destroy_cep function does not block, so you could receive a
>> callback after you free the ID.  You need a way to mark the ID freed so
>> that the handler doesn't invoke the callback.  You then need to do
>> reference counting on your ID structures so that they can be freed
>> after the CEP manager is done with them. Alternatively, you can
>> allocate an event/block until the CEP is freed (your destroy callback
>> is invoked).
>
> What is the threading at the MAD level calling back to the CM?  Is
> there a single dispatch thread?  Several?

The callback is invoked in the context of the MAD completion handler.  However it's protected by the CEP's lock (at least the information about whether the CEP has been signaled or not) so you won't ever have two simultaneous callbacks for the same CEP.  You might have simultaneous callbacks for different CEPs.

If you destroy the CEP from the callback you will be safe.  If you destroy it from a different thread context you will need to handle a callback being delivered.

>>>         ExFreePool(p_id);
>>> }
>>
>> <snip...>
>>
>>> static NTSTATUS
>>> cm_get_qp_attr(iba_cm_id *p_id, ib_qp_state_t state, ib_qp_mod_t
>>> *p_attr)
>>> {
>>>         ib_api_status_t ib_status;
>>>
>>>         switch (state) {
>>>         case IB_QPS_INIT:
>>>                 ib_status = al_cep_get_init_attr(gh_al, p_id->cid,
>>>                 p_attr); break; case IB_QPS_RTR: ib_status =
>>>                 al_cep_get_rtr_attr(gh_al, p_id->cid,
>>> p_attr);
>>
>> How will you handle the passive side accepting?  The QP attributes
>> can be changed by sending the REP - the CEP manager splits the REP
>> into a
>> pre- and
>> send- calls so that the updates to the CEP's QP attributes can happen
>> in the
>> pre- call (which returns the attributes), so the client can do the
>> RTR transition, and then the REP can be sent by the send- call.  Is
>> the expectation here that the client will call get_rtr_attr, make the
>> changes they intend to make with the REP manually in the returned QP
>> attribute structure, then call iba_cm_rep?
>
>  The passive side needs to adjust any QP attributes before calling
> modify.  If I recall correctly, the only missing data is the responder
> resources (for RTR) and initiator depth (for RTS).  From a user's
> perspective, the pre calls are replaced with just setting these two
> fields.

So you lose the automatic capping of the initiator depth to the CA's capabilities.  This means that the user will need to query the CA and adjust.  Or were you planning on capping still and just modifying the received REQ MAD?

>>>                 break; case IB_QPS_RTS: ib_status =
>>>                 al_cep_get_rts_attr(gh_al, p_id->cid, p_attr); break;
>>>                 default: return STATUS_INVALID_PARAMETER;
>>>         }
>>>
>>>         return convert_ib_status(ib_status); }



More information about the ofw mailing list