[Openib-windows] [RFC v2] Kernel Async Verbs

Fri Nov 18 09:47:24 PST 2005

Hi Tzachi,

> From: Tzachi Dar [mailto:tzachid at mellanox.co.il]
> Sent: Friday, November 18, 2005 5:50 AM
> 
> Hi Fab,
> 
> In order to see the influence of using IRP's and not using
> IRP's I have made a simple experiment: I have checked how long
> It takes for a call from user to kernel, using IRP to pass
> From user to kernel, and the same call without an IRP (using
> FastIoDispatch).

Thanks for trying this out - very interesting results.  I can't find much
documentation at all for FastIoDispatch.  What I did find was related to
installable file systems.  Can you point me to some docs that explain the APIs
from user-mode as well as implementation requirements for kernel-mode?

> Here are the results on a pci-express (32 bit) ~3.2MHZ machine.
> With IOCTLS: 321926 operations a second (or 3.1 us per operation).
> Without IOCTLS: 1422475 operations a second (or 0.7 us).

The code today uses IOCTLs from user-mode, and that hasn't been a problem so
far.  From the work I've done tuning Winsock Direct, I know that a SetEvent
operation takes ~2us, and then there are a few more micros spent waiting for the
thread to get scheduled.  So an IOCTL method should perform as well as the
current model, but provide extra flexibility to the caller.

> As a result, you can see that each IRP has a very high overhead. This
> overhead (~350%) is compared to passing from User to kernel. The
> overhead of passing from kernel to kernel (a normal call will be much
> higher in percentages).

Note that the IRP is not passed down the driver stack via IoCallDriver - it is
passed as input to the direct call interface.  Only a single driver (the HCA
driver) will invoke IoCallDriver so that IoCompleteRequest can be called.

> Please also note that on my experiment, I was completing the IRP's
> immediately. On a "real life" scenario I would have to mark the IRP
> pending first, and later complete it, so the overhead will be much
> higher.

Does the FastIoDispatch mechanism support asynchronous processing or requests?
How?

> As a result, I strongly suggest using a callback function and a context
> instead of IRP's.
> 
> Do these things sound logical? Do you want to try and repro the
> experiment on your machines?

Yes, this sounds logical.  I want to understand better how to use
FastIoDispatch, but if it will work and get us some latency savings in command
processing it may well be worth the extra complexity in the driver of
implementing our own callback mechanisms.  I'd love to see the code for the
experiment, just so I can understand how the fast I/O dispatch stuff works.
>From what I've seen on the web searching for information it doesn't look like it
will work for asynchronous processing, so will again require blocking in the
caller's thread.  Blocking, if required, would eliminate any latency gains.

Lastly, could you share the latencies of your HCAs for the various commands?  If
it is in the 100's of microseconds, a few extra microseconds won't matter.

Thanks,

- Fab