[openib-general] Async event handlers: per consumer or per QP/CQ?

Thu Sep 2 19:28:48 PDT 2004

I'm starting to work on implementing a new async event handling scheme
to match our new API, and I'm starting to think that our initial
design is somewhat suboptimal.

To recap, the plan was to put the async handler in struct ib_device
give each consumer a different copy of each device's struct
ib_device.  However, as I think about how to actually implement this,
it starts to look like not such a good idea.  If we want to give each
consumer a copy of struct ib_device and still minimize the amount of
pointer chasing that fast path functions do, it seems like this copy
has to have all of the low-level driver's private state copied as well.

If we do this, we now have the problem of coherency between different
copies, etc.  I'm sure these problems could be solved, but it seems
like it will make things much more complicated than they need to be.

I would propose putting a list of async handlers in struct ib_device
for unaffiliated async events and put an async handler function
pointer in the QP/CQ struct.  The argument against this was that it
adds overhead to have all these duplicated function pointers.  Right
now (if I remove the refcnt and wait members, based on my plan to
implement a better locking scheme in mthca), the sizes are:

			32-bit		64-bit
struct mthca_cq		76 bytes	104 bytes
struct mthca_qp		156 bytes	224 bytes

add in the fact that every CQ and QP will have at least a page of
memory dedicated to the actual queues, and the overhead of one more
pointer (4 or 8 bytes depending on architecture) seems like it's lost
in the noise.

In fact, since right now mthca is just using kmalloc() to allocate
these structures, the sizes are getting rounded up to a power of 2
anyway, so adding another pointer member really will have zero impact
on our memory usage.  If/when we switch to having separate slab
caches, the worst effect would be dropping from 39 mthca_cqs per 4K
page down to 36 mthca_cqs on a 64-bit arch.

In my mind the big simplification of the code far outweighs the slight
additional memory usage.  What are other people's thoughts?

Thanks,
  Roland