[Openib-windows] [RFC v2] Kernel Async Verbs

Tzachi Dar tzachid at mellanox.co.il
Mon Nov 21 08:45:46 PST 2005


Hi Fab,

Attached is the code from my example, you need to apply this patch on
the SDP driver that is now in openib.

Since the user mode code is still not there, you have to use the
following code fragments from user mode

To open the device, use:

    m_hKernelLib =
            CreateFileW(
            SDP_WIN32_NAME,            
            GENERIC_READ | GENERIC_WRITE,
            FILE_SHARE_READ | FILE_SHARE_WRITE,
// share mode none
            NULL,                                               // no
security
            OPEN_EXISTING,
            FILE_ATTRIBUTE_NORMAL,
            NULL                                                // no
template
            );


To send the IOCTL down, please use:
    WspSocketIn SocketIn;
    WspSocketOut SocketOut;
    DWORD BytesReturned = 0;

    BOOL ret = DeviceIoControl(
                    m_hKernelLib,
                    IOCTL_WSP_SOCKET,
                    &SocketIn,
                    sizeof(SocketIn),
                    &SocketOut,
                    sizeof(SocketOut),
                    &BytesReturned,
                    NULL
                    );
There is a lot of more data bellow, but the main claim that I want to
say is this: IRP processing costs a lot of computation time, if we
introduce them now, it will be harder to get them out. They are
(probably) more expensive (relatively) when we use them from kernel to
kernel. This is because, we can create an interface that will allow the
calls to the driver to be "normal calls" which are very cheap (I was
measuring ~1.2 ns for a trivial function). 

More data bellow.

Thanks
Tzachi

>-----Original Message-----
>From: Fab Tillier [mailto:ftillier at silverstorm.com]
>Sent: Friday, November 18, 2005 7:47 PM
>To: 'Tzachi Dar'; openib-windows at openib.org
>Subject: RE: [Openib-windows] [RFC v2] Kernel Async Verbs
>
>Hi Tzachi,
>
>> From: Tzachi Dar [mailto:tzachid at mellanox.co.il]
>> Sent: Friday, November 18, 2005 5:50 AM
>>
>> Hi Fab,
>>
>> In order to see the influence of using IRP's and not using
>> IRP's I have made a simple experiment: I have checked how long
>> It takes for a call from user to kernel, using IRP to pass
>> From user to kernel, and the same call without an IRP (using
>> FastIoDispatch).
>
>Thanks for trying this out - very interesting results.  I can't find
much
>documentation at all for FastIoDispatch.  What I did find was related
to
>installable file systems.  Can you point me to some docs that explain
the
>APIs
>from user-mode as well as implementation requirements for kernel-mode?
I don't know of much documentation beside what google has to tell us. In
any case attached is a patch with an example of how to use the code.
In general, what I remember is this: 1) In user mode things stay exactly
the same, no new interface what so ever. 2) In kernel mode there is a
call to the fast function (without an IRP). If the request can be
completed immediately, than you are on the fast path no IRP was created
and your performance is relatively good. 3) OSR says that the buffers
should be treated as "raw", and as I understand this, it means that
there is a need to check the buffers passed before each time we touch
them. (Not the case in my example).

So here are the very few sources that exists:
http://www.cmkrnl.com/arc-fastio.html This is an old example of such a
code.
Here is something from ms:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/IFSK_d/
hh/IFSK_d/Ch2IntroToFsFilters_50eaf4be-3189-45a8-9ba8-a21b2a6ff0c3.xml.a
sp 

Here is something from osr:
http://www.osronline.com/showThread.cfm?link=75873 


>> Here are the results on a pci-express (32 bit) ~3.2MHZ machine.
>> With IOCTLS: 321926 operations a second (or 3.1 us per operation).
>> Without IOCTLS: 1422475 operations a second (or 0.7 us).
>
>The code today uses IOCTLs from user-mode, and that hasn't been a
problem
>so
>far.  From the work I've done tuning Winsock Direct, I know that a
SetEvent
>operation takes ~2us, and then there are a few more micros spent
waiting
>for the
>thread to get scheduled.  So an IOCTL method should perform as well as
the
>current model, but provide extra flexibility to the caller.
The real saving that I'm trying to get is the processing time from
kernel mode. It is possible to create an interface that allows the calls
and callbacks using a completion function and a context. I believe that
the overhead of such a mechanism will be ~0.5us Using IOCTLs I'm afraid
we are going to be spending 3us which is a waste of the CPU time. (To my
understanding this is a time that the CPU is loaded, while we don't
really know/care about what). A better experiment that we can and should
do, is doing a function call in kernel mode using IOCTL's. An excellent
example for this is the code in the function SdpArp::SourcePortGidFromIP
on the file sdparp.cpp. I believe that we will see that more than 2 us
are being lost there, while we can very likely have the same affect in
only 0.2 us (a simple function call). This are actually the time that
I'm trying to save.


>> As a result, you can see that each IRP has a very high overhead. This
>> overhead (~350%) is compared to passing from User to kernel. The
>> overhead of passing from kernel to kernel (a normal call will be much
>> higher in percentages).
>
>Note that the IRP is not passed down the driver stack via IoCallDriver
- it
>is
>passed as input to the direct call interface.  Only a single driver
(the
>HCA
>driver) will invoke IoCallDriver so that IoCompleteRequest can be
called.
>
Again, the question is how long will it takes us.

>> Please also note that on my experiment, I was completing the IRP's
>> immediately. On a "real life" scenario I would have to mark the IRP
>> pending first, and later complete it, so the overhead will be much
>> higher.
>
>Does the FastIoDispatch mechanism support asynchronous processing or
>requests?
>How?
Directly the answer is no. This means that if the request can not be
satisfied, an IRP is created. Still, this can (and I have worked on a
project that was using this techniques) be used as a way to pass from
user to kernel. An example to such a mechanism that allows completion,
is to store the request details, and answer if the request was
satisfied. When you have the answer, you have to pass data from the
kernel to the user. You can achieve this for example by signaling an
event. Throughput can be increased significantly, if you can combine a
request with an answer. But still, I'm not trying to create a mechanism
to pass from user to kernel, I want the kernel code to be more
efficient. Please note that if we start now with requiring an IRP for
each call, it will be harder to remove them when the total performance
will be low.


>> As a result, I strongly suggest using a callback function and a
context
>> instead of IRP's.
>>
>> Do these things sound logical? Do you want to try and repro the
>> experiment on your machines?
>
>Yes, this sounds logical.  I want to understand better how to use
>FastIoDispatch, but if it will work and get us some latency savings in
>command
>processing it may well be worth the extra complexity in the driver of
>implementing our own callback mechanisms.  I'd love to see the code for
the
>experiment, just so I can understand how the fast I/O dispatch stuff
works.
>>From what I've seen on the web searching for information it doesn't
look
>like it
>will work for asynchronous processing, so will again require blocking
in
>the
>caller's thread.  Blocking, if required, would eliminate any latency
gains.
>
>Lastly, could you share the latencies of your HCAs for the various
>commands?  If
>it is in the 100's of microseconds, a few extra microseconds won't
matter.
I'm not sure what are the HCA numbers, still I'm trying to solve the
extra performance time, not actually decrease the latency.


>
>Thanks,
>
>- Fab
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fastio.diff
Type: application/octet-stream
Size: 5768 bytes
Desc: Fastio.diff
URL: <http://lists.openfabrics.org/pipermail/ofw/attachments/20051121/7dcd52eb/attachment.obj>


More information about the ofw mailing list