[Openib-windows] [RFC v2] Kernel Async Verbs

Mon Nov 7 10:35:10 PST 2005

> From: Tzachi Dar [mailto:tzachid at mellanox.co.il]
> Sent: Monday, November 07, 2005 12:59 AM
> 
> As far as I remember we have agreed that the IP's shouldn't include IRPs, but
> a different completion mechanism.

No, what was discussed was not having the API called via IoCallDriver and not
using IOCTL buffers for the parameters, but instead having a direct-call
interface.  Completion notifications have always been designed to use IRPs.
There's no reason to implement a new and different completion mechanism when
using IRPs will meet every requirement while at the same time simplifying
user-mode support.

> As a result I'm quite confused to see that all your API's do contain IRP's.
> Please remember that using an IRP on each function will:
> 1) Make our code much more complicated.

How does it make the code more complicated?  When a command completes, the
driver calls IoCompleteRequest.  What's so complicated about this?  The IRP is
just used as a mechanism for completing requests.

> 2) Make a considerable performance hit to all our API's.

How so?  The IRPs aren't passed down the driver stack through IoCallDriver.  The
only requirement is for the HCA driver to call IoCallDriver internally so that
it can subsequently call IoCompleteRequest.  Do you really think the call to
IoCallDriver and IoCompleteRequest is going to be slower than putting a thread
into a wait state and then waking it up like the current code does?

The driver would keep a pointer to the IRP to complete when it issues a command.
The driver would call IoCallDriver on itself and queue the IRP.  When the
command completes, the driver dequeues the IRP and calls IoCompleteRequest.

The only requirement on clients is that the IRP have at least one IRP stack
location for the HCA driver to use.  The only requirement on the HCA driver is
to restore any fields in the IRP that were changed to enable queuing.

Please explain and quantify the performance hit because I don't see it.

> 3) Will make our code very different from all other windows software that I
> know of. As an example, look at the ZwWriteFile function. Although this
> function can currently only be called from passive level, it's API is prepared
> for assync transfer, but still doesn't use IRP's

Take a look at WSKSocket, WSKSend, etc. in the Windows Sockets in Kernel section
of the Windows Driver Kit documentation.  I've mentioned before the API is
designed similarly.  The WSK functions were designed to solve a very similar
problem to what we have today - allow dispatch-level socket usage.  What our API
does is allow dispatch-level IB usage using similar mechanisms.

> 4) Will make it impossible to share the same header files from user mode and
> kernel mode (the way that IRP's look from user mode is by an overlapped
> struct). By the way, taking into account the differences between user mode and
> kernel mode, it is not clear if we want to do this in any way.

Sharing headers is not a goal - I want the API to be optimized for the
environment in which it is to run.  We need to move away from sharing headers
between kernel and user-mode to allow using the facilities provided by the OS.
I don't expect that we will want or need to support async verbs in user-mode.  I
do want to change how affiliated, unaffiliated, and completion notifications are
delivered though so that the user can select the completion method that fits
their model best - synchronous, async via GetOverlappedResult, I/O completion
ports, or APCs.  I touched on some of these items in my presentation in August
about user-mode future.

My ultimate goal for user-mode is to eliminate all threads from the access
layer, and have an API that enables a purely single threaded application to work
properly.  We need to move away from having our own completion notification
mechanisms as they introduce a lot of complexity to the access layer, both in
user and kernel mode.

I also want to move away in general from duplicating functionality already
provided by the OS.  If the OS provides a mechanism to get something done, I
want to leverage that so that the code I have to maintain and update is reduced.

> 5) IRP's are harder to debug. Since IRP's are maintined in the operating
> system opaque structures, finding where an IRP was lost, is somewhat harder.

I don't buy this one bit.  First, IRPs are the primary mechanism for
user-to-kernel interaction as well as driver-to-driver communication.  We
already have to debug these, and there are commands in the debugger as well as
in verifier that support this.

> I believe that since in any case, we will be using wrapper functions around
> the IRPs based functions, we should probably write the wrapper function first.

Why would we use wrapper functions?  This is a direct call interface, not an
IOCTL driven interface.  I would expect kernel clients to call this interface
directly, especially the proxy for user-mode support.  The end goal is to have
the application's IRP used as the IRP for the API completion, but there are a
few steps to take before we get there.

> In any case, if what I wrote isn't enough, I believe that the best way to
> continue this discussion is to try and take one of these functions, implement
> it in some trivial way and see the results. Once we do this in both ways, we
> can look at the code that is using the functions, measure the performance and
> see how we want to continue from here.

I plan on taking all of these functions and implementing them.  Once the drivers
are structured to handle verbs issued at dispatch changing the API from taking
an IRP to taking a callback will be straight forward.  If during implementation
the IRP completion model proves to be unwieldy it will change.

Aside from not liking the IRP completion model, are there any issues you see
with the API definitions?  Imagine replacing the IRP input parameter with a
callback and a callback context.  Think of the other ramifications changing the
completion notification mechanism has on the client.  Does the end-to-end
processing of a command become that much simpler?

Here are things off the top of my head that would have to change when removing
the IRP completion mechanism:
- APIs need to take a flag indicating a user-mode vs. kernel-mode calls.
- APIs need to take a completion routine like:
	void (*VERB_CB)(
		DEVICE_OBJECT *pDevObj,
		void *Context,
		NTSTATUS Status,
		ULONG_PTR Information );
- APIs need to take a device object and context value to pass to the callback.

I don't know if making the above changes really buys us much.  I still think
that the IRP-based calls for getting unaffiliated, affiliated, and completion
events simplifies the code base as a whole - it removes the need for the proxy
to queue these IRPs on behalf of the HCA driver.  One way or another, the IRPs
must be queued.  Why not put the queuing in the HCA driver and remove it from
the proxy?

Thanks,

- Fab