[ofa-general] RE: [RFC v2 PATCH 3/5] rdma/cma: add high availability mode attribute to IDs

Mon May 19 14:24:49 PDT 2008

>Sean, please let me know your preference (as it was somehow unclear from
>the thread) if you want the delivery of this event to be dependent on
>the ulp asking for it or no.

I spent most of the morning looking at this, and until I know what the
trade-offs really are in the implementation, I can't say that I have a strong
preference for how to deal with any of this.  My main concerns are:

* All callbacks from the rdma_cm are serialized
* We minimize the overhead of reporting events
* We don't lose events
* If the user returns a non-zero value from a callback, the rdma_cm_id is
  destroyed, an no further callbacks are invoked.

and in concept I prefer to:

* Always report the event and let ULPs ignore it
* Let someone come up with a fantastically simple way of reporting new events

The existing rdma_cm callbacks are naturally serialized with each other.
(Callback for connect after resolve route after resolve address...)  This allows
using the stack for event structures, but the cost is complex synchronization
with device removal.  Supporting additional events while meeting the concerns
listed above will be equally challenging.  So if we can simplify device removal
handling, then supporting similar types of events should be easier as well.

If we can guarantee that this works, one option is to acquire a mutex before
invoking a callback on an rdma_cm_id.  I hesitate to hold any locks while in a
callback, since it restricts what the user can do, but if the mutex is only used
to synchronize calling the user back, it may work, since the rdma_cm never
invokes a callback from a downcall.  This should simplify the device removal
handling, eliminating wait_remove and dev_remove from the rdma_cm_id.

Alternatively, the ib_cm serializes callbacks using different logic (see
cm_process_work() and use of work_count/work_list).  I've been looking at what
it would take to use the ib_cm event logic in the rdma_cm.  The trick is to
minimize the event reporting overhead without losing any events, (and minimizing
the overhead may require registering for events...)  

What I've been exploring is adding an event_list to the rdma_cm_id.  Whenever
the user performs an asynchronous operation, event structure(s) is allocated and
placed on the event_list.  When an asynchronous operation completes, the event
structure is removed from this list, placed on a work_list, and a call like
cma_process_work() is invoked.  Note that some operations (e.g. connect) result
in multiple callbacks to the rdma_cm (connect and disconnect).  And the more I
consider this option, the more appealing just holding a mutex around the
callbacks becomes.

- Sean