[Openib-windows] [RFC v2] Kernel Async Verbs

Mon Nov 21 10:11:34 PST 2005

Hi Tzachi,

I'll code up a test of IRP processing time for kernel clients after the address
translation work is done.  What I'll do is add code in the proxy to generate a
given number of IRPs for a no-op request and see how long it takes.  I expect
that the calls from kernel-to-kernel are going to be shorter than user-mode to
kernel since for user-mode calls the I/O manager buffers the input and output
user buffers.  For kernel-to-kernel calls, the I/O manager doesn't do any such
mappings.  If the buffers in the fast I/O dispatch calls are indeed raw buffers
as you suspect that is likely where a good portion of the time is saved.  In any
case, comparing a code path that doesn't map buffers and doesn't support
asynchronous use isn't really useful.

The benchmark code I plan will result in only a single IoCallDriver call as well
as a single IoCompleteRequest call.  It will use an I/O completion callback and
allocate the IRP for each iteration.  The IRP will be queued in a list and a DPC
queued to complete it.  I'll then compare it with a direct call interface where
the request is queued and again completed via DPC.  This will be the model that
is closest to how the HCA driver will behave, and will give us the best
comparison.  Let me know if you think there is a flaw in my plan.

There are some calls in the interface I defined that need to keep the IRP as
input parameters - all the event notification requests.  These were put in
specifically for user-mode, so that a user-mode client can queue notification
requests.  Kernel clients would likely just use the direct callback mechanisms,
though would be free to use the IRP mechanism if they want.  These functions
are:
	GetCaFatalEvent
	GetCaPortEvent
	GetCqCompEvent
	GetCqAsyncEvent
	GetQpAsyncEvent

Since these user-mode calls come down as IRPs (they're processed
asynchronously), it's simpler to have them queue up within the HCA driver rather
than handle the case where some middle man has to handle the synchronization of
callbacks from the HCA driver with requests from the application.

One of the main reasons I chose IRPs for completion notifications (aside from
similarities with the WSK API) was to provide for common handling for kernel and
user-mode requests.  The premise here was that using IRPs would be no lower
performing than the existing blocking implementation, and the calls involved are
not speed-path calls.

I'm open to looking at alternatives, but want to be careful to not complicate
the implementation to gain a few microseconds in code paths that don't matter.
Having information about command interface latencies would be great.  The
Programmer's References Manual for the MT25208 chip states that command
latencies depend on command class, A (up to 1ms), B (up to 10ms), C (up to
100ms), and D (beyond 1000ms).  Most commands look to be in the A or B class.
If we're looking at command latencies that are measured in milliseconds, I don't
think we should spend too much time trying to squeeze out a few extra
microseconds.  It would be great if you can get and share more detailed
information about the command latencies that would show that a few microseconds
would make a difference.

Lastly, remember that speed path operations, since they all complete
immediately, don't use IRPs at all.

- Fab