[Openib-windows] [RFC v2] Kernel Async Verbs

Tzachi Dar tzachid at mellanox.co.il
Fri Nov 18 05:50:11 PST 2005


Hi Fab,

In order to see the influence of using IRP's and not using 
IRP's I have made a simple experiment: I have checked how long 
It takes for a call from user to kernel, using IRP to pass
>From user to kernel, and the same call without an IRP (using
FastIoDispatch).
Here are the results on a pci-express (32 bit) ~3.2MHZ machine. 
With IOCTLS: 321926 operations a second (or 3.1 us per operation).
Without IOCTLS: 1422475 operations a second (or 0.7 us).
As a result, you can see that each IRP has a very high overhead. This
overhead (~350%) is compared to passing from User to kernel. The
overhead of passing from kernel to kernel (a normal call will be much
higher in percentages). 
Please also note that on my experiment, I was completing the IRP's
immediately. On a "real life" scenario I would have to mark the IRP
pending first, and later complete it, so the overhead will be much
higher.

As a result, I strongly suggest using a callback function and a context
instead of IRP's.

Do these things sound logical? Do you want to try and repro the
experiment on your machines?

Thanks
Tzachi

>-----Original Message-----
>From: Fab Tillier [mailto:ftillier at silverstorm.com]
>Sent: Monday, November 07, 2005 8:35 PM
>To: 'Tzachi Dar'; openib-windows at openib.org
>Subject: RE: [Openib-windows] [RFC v2] Kernel Async Verbs
>
>> From: Tzachi Dar [mailto:tzachid at mellanox.co.il]
>> Sent: Monday, November 07, 2005 12:59 AM
>>
>> As far as I remember we have agreed that the IP's shouldn't include
IRPs,
>but
>> a different completion mechanism.
>
>No, what was discussed was not having the API called via IoCallDriver
and
>not
>using IOCTL buffers for the parameters, but instead having a
direct-call
>interface.  Completion notifications have always been designed to use
IRPs.
>There's no reason to implement a new and different completion mechanism
>when
>using IRPs will meet every requirement while at the same time
simplifying
>user-mode support.
>
>> As a result I'm quite confused to see that all your API's do contain
>IRP's.
>> Please remember that using an IRP on each function will:
>> 1) Make our code much more complicated.
>
>How does it make the code more complicated?  When a command completes,
the
>driver calls IoCompleteRequest.  What's so complicated about this?  The
IRP
>is
>just used as a mechanism for completing requests.
>
>> 2) Make a considerable performance hit to all our API's.
>
>How so?  The IRPs aren't passed down the driver stack through
IoCallDriver.
>The
>only requirement is for the HCA driver to call IoCallDriver internally
so
>that
>it can subsequently call IoCompleteRequest.  Do you really think the
call
>to
>IoCallDriver and IoCompleteRequest is going to be slower than putting a
>thread
>into a wait state and then waking it up like the current code does?
>
>The driver would keep a pointer to the IRP to complete when it issues a
>command.
>The driver would call IoCallDriver on itself and queue the IRP.  When
the
>command completes, the driver dequeues the IRP and calls
IoCompleteRequest.
>
>The only requirement on clients is that the IRP have at least one IRP
stack
>location for the HCA driver to use.  The only requirement on the HCA
driver
>is
>to restore any fields in the IRP that were changed to enable queuing.
>
>Please explain and quantify the performance hit because I don't see it.
>
>> 3) Will make our code very different from all other windows software
that
>I
>> know of. As an example, look at the ZwWriteFile function. Although
this
>> function can currently only be called from passive level, it's API is
>prepared
>> for assync transfer, but still doesn't use IRP's
>
>Take a look at WSKSocket, WSKSend, etc. in the Windows Sockets in
Kernel
>section
>of the Windows Driver Kit documentation.  I've mentioned before the API
is
>designed similarly.  The WSK functions were designed to solve a very
>similar
>problem to what we have today - allow dispatch-level socket usage.
What
>our API
>does is allow dispatch-level IB usage using similar mechanisms.
>
>> 4) Will make it impossible to share the same header files from user
mode
>and
>> kernel mode (the way that IRP's look from user mode is by an
overlapped
>> struct). By the way, taking into account the differences between user
>mode and
>> kernel mode, it is not clear if we want to do this in any way.
>
>Sharing headers is not a goal - I want the API to be optimized for the
>environment in which it is to run.  We need to move away from sharing
>headers
>between kernel and user-mode to allow using the facilities provided by
the
>OS.
>I don't expect that we will want or need to support async verbs in
user-
>mode.  I
>do want to change how affiliated, unaffiliated, and completion
>notifications are
>delivered though so that the user can select the completion method that
>fits
>their model best - synchronous, async via GetOverlappedResult, I/O
>completion
>ports, or APCs.  I touched on some of these items in my presentation in
>August
>about user-mode future.
>
>My ultimate goal for user-mode is to eliminate all threads from the
access
>layer, and have an API that enables a purely single threaded
application to
>work
>properly.  We need to move away from having our own completion
notification
>mechanisms as they introduce a lot of complexity to the access layer,
both
>in
>user and kernel mode.
>
>I also want to move away in general from duplicating functionality
already
>provided by the OS.  If the OS provides a mechanism to get something
done,
>I
>want to leverage that so that the code I have to maintain and update is
>reduced.
>
>> 5) IRP's are harder to debug. Since IRP's are maintined in the
operating
>> system opaque structures, finding where an IRP was lost, is somewhat
>harder.
>
>I don't buy this one bit.  First, IRPs are the primary mechanism for
>user-to-kernel interaction as well as driver-to-driver communication.
We
>already have to debug these, and there are commands in the debugger as
well
>as
>in verifier that support this.
>
>> I believe that since in any case, we will be using wrapper functions
>around
>> the IRPs based functions, we should probably write the wrapper
function
>first.
>
>Why would we use wrapper functions?  This is a direct call interface,
not
>an
>IOCTL driven interface.  I would expect kernel clients to call this
>interface
>directly, especially the proxy for user-mode support.  The end goal is
to
>have
>the application's IRP used as the IRP for the API completion, but there
are
>a
>few steps to take before we get there.
>
>> In any case, if what I wrote isn't enough, I believe that the best
way to
>> continue this discussion is to try and take one of these functions,
>implement
>> it in some trivial way and see the results. Once we do this in both
ways,
>we
>> can look at the code that is using the functions, measure the
performance
>and
>> see how we want to continue from here.
>
>I plan on taking all of these functions and implementing them.  Once
the
>drivers
>are structured to handle verbs issued at dispatch changing the API from
>taking
>an IRP to taking a callback will be straight forward.  If during
>implementation
>the IRP completion model proves to be unwieldy it will change.
>
>Aside from not liking the IRP completion model, are there any issues
you
>see
>with the API definitions?  Imagine replacing the IRP input parameter
with a
>callback and a callback context.  Think of the other ramifications
changing
>the
>completion notification mechanism has on the client.  Does the
end-to-end
>processing of a command become that much simpler?
>
>Here are things off the top of my head that would have to change when
>removing
>the IRP completion mechanism:
>- APIs need to take a flag indicating a user-mode vs. kernel-mode
calls.
>- APIs need to take a completion routine like:
>	void (*VERB_CB)(
>		DEVICE_OBJECT *pDevObj,
>		void *Context,
>		NTSTATUS Status,
>		ULONG_PTR Information );
>- APIs need to take a device object and context value to pass to the
>callback.
>
>I don't know if making the above changes really buys us much.  I still
>think
>that the IRP-based calls for getting unaffiliated, affiliated, and
>completion
>events simplifies the code base as a whole - it removes the need for
the
>proxy
>to queue these IRPs on behalf of the HCA driver.  One way or another,
the
>IRPs
>must be queued.  Why not put the queuing in the HCA driver and remove
it
>from
>the proxy?
>
>Thanks,
>
>- Fab



More information about the ofw mailing list