[Openib-windows] [RFC v2] Kernel Async Verbs

Mon Nov 21 12:54:11 PST 2005

Hi Fab,

Please go ahead with the experiment that you have described. I'll try to
find real data about the time it takes to do a command on the command
interface, and once we have the date we will be able to know where we
are going.

Thanks
Tzachi

>-----Original Message-----
>From: Fab Tillier [mailto:ftillier at silverstorm.com]
>Sent: Monday, November 21, 2005 8:12 PM
>To: 'Tzachi Dar'; openib-windows at openib.org
>Subject: RE: [Openib-windows] [RFC v2] Kernel Async Verbs
>
>Hi Tzachi,
>
>I'll code up a test of IRP processing time for kernel clients after the
>address
>translation work is done.  What I'll do is add code in the proxy to
>generate a
>given number of IRPs for a no-op request and see how long it takes.  I
>expect
>that the calls from kernel-to-kernel are going to be shorter than
user-mode
>to
>kernel since for user-mode calls the I/O manager buffers the input and
>output
>user buffers.  For kernel-to-kernel calls, the I/O manager doesn't do
any
>such
>mappings.  If the buffers in the fast I/O dispatch calls are indeed raw
>buffers
>as you suspect that is likely where a good portion of the time is
saved.
>In any
>case, comparing a code path that doesn't map buffers and doesn't
support
>asynchronous use isn't really useful.
>
>The benchmark code I plan will result in only a single IoCallDriver
call as
>well
>as a single IoCompleteRequest call.  It will use an I/O completion
callback
>and
>allocate the IRP for each iteration.  The IRP will be queued in a list
and
>a DPC
>queued to complete it.  I'll then compare it with a direct call
interface
>where
>the request is queued and again completed via DPC.  This will be the
model
>that
>is closest to how the HCA driver will behave, and will give us the best
>comparison.  Let me know if you think there is a flaw in my plan.
>
>There are some calls in the interface I defined that need to keep the
IRP
>as
>input parameters - all the event notification requests.  These were put
in
>specifically for user-mode, so that a user-mode client can queue
>notification
>requests.  Kernel clients would likely just use the direct callback
>mechanisms,
>though would be free to use the IRP mechanism if they want.  These
>functions
>are:
>	GetCaFatalEvent
>	GetCaPortEvent
>	GetCqCompEvent
>	GetCqAsyncEvent
>	GetQpAsyncEvent
>
>Since these user-mode calls come down as IRPs (they're processed
>asynchronously), it's simpler to have them queue up within the HCA
driver
>rather
>than handle the case where some middle man has to handle the
>synchronization of
>callbacks from the HCA driver with requests from the application.
>
>One of the main reasons I chose IRPs for completion notifications
(aside
>from
>similarities with the WSK API) was to provide for common handling for
>kernel and
>user-mode requests.  The premise here was that using IRPs would be no
lower
>performing than the existing blocking implementation, and the calls
>involved are
>not speed-path calls.
>
>I'm open to looking at alternatives, but want to be careful to not
>complicate
>the implementation to gain a few microseconds in code paths that don't
>matter.
>Having information about command interface latencies would be great.
The
>Programmer's References Manual for the MT25208 chip states that command
>latencies depend on command class, A (up to 1ms), B (up to 10ms), C (up
to
>100ms), and D (beyond 1000ms).  Most commands look to be in the A or B
>class.
>If we're looking at command latencies that are measured in
milliseconds, I
>don't
>think we should spend too much time trying to squeeze out a few extra
>microseconds.  It would be great if you can get and share more detailed
>information about the command latencies that would show that a few
>microseconds
>would make a difference.
>
>Lastly, remember that speed path operations, since they all complete
>immediately, don't use IRPs at all.
>
>- Fab