[Openib-windows] [RFC] IRP-based verbs

Fri Sep 9 10:05:23 PDT 2005

> From: Tzachi Dar [mailto:tzachid at mellanox.co.il]
> Sent: Friday, September 09, 2005 4:10 AM
>
> >-----Original Message-----
> >From: Fab Tillier [mailto:ftillier at silverstorm.com]
> >Sent: Thursday, September 08, 2005 10:00 PM
> >
> >> From: Tzachi Dar [mailto:tzachid at mellanox.co.il]
> >> Sent: Thursday, September 08, 2005 11:28 AM
> >>
> >> 1) Do you intend that all the kernel mode ULPs to move to an IOCTL based
> >> model? That is do you want the IPOIB model to create an IRP every time it
> >> is going to talk with IBAL?
> >
> >Initially, this would be deployed just between IBAL and the HCA driver.
> >Over time ULPs would be transitioned to it too.
> >
> >Speed-path (post/poll) operations would still be direct-call.  I'd like to
> >eventually have a direct call interface similar to Microsoft's new kernel
> >socket model (WSK), but that would still take as input an IRP for completion
> >notifications.
>
> The IOCTL interface is a very complicated interface that is mainly used for
> communicating between user mode and kernel applications. Since it is so
> complicated there will probably by wrapper functions around it, so I believe
> that there will be no need to use the IOCTLs (at all) for communicating in the
> kernel.

It doesn't have to be IOCTLs necessarily - the goal is to use IRPs to leverage
the I/O completion processing provided by the I/O manager, rather than
re-implementing everything internally.  In handling IRPs, there's not much
difference between PnP requests, IOCTLs, and any other type of IRP.

What's so complicated about IOCTL handling?  Direct calls would be fine if they
could work at DISPATCH - the problem is they don't.  The idea is two-fold:
- Anything that currently requires PASSIVE IRQL changes to support calls at
DISPATCH.  This means sync calls become async.
- Anything that currently implements its own callback notifications changes to
use an IRP to manage such notifications.  This means the newly async verb calls
use IRPs to indicate completions.  It also means CQ notifications complete an
IRP.

IRPs are the fundamental I/O building block in Windows.  From the DDK
documentation "An IRP is the basic I/O manager structure used to communicate
with drivers and to allow drivers to communicate with each other."  Is Microsoft
incorrect in making that claim?  Sure, there are mechanisms for direct call
interfaces and those will still exist, at a minimum for the speed path
operations.  Notice that the direct call interfaces do support being called at
dispatch level - the BUS_INTERFACE_STANDARD is actually designed to eliminate
PASSIVE IRQL requirements imposed by IoGetDmaAdapter, IRP_MN_READ_CONFIG,
IRP_MN_WRITE_CONFIG, etc.

> As for the user mode, we can create a different library that will be
> used to communicate with the kernel, it will use IOCTLs but these IOCTLs will
> be translated to the regular "kernel" interface at a high level. Please note
> that if we anticipate stress on the calls from user to kernel, than IOCTLs are
> not the best solution even when passing from user to kernel. We can of course
> extend this talk, if we see that there is need.

How would you use something other than IOCTLs (or IRPs) to communicate between
user-mode and kernel-mode?  And how would it be simpler than calling
DeviceIoControl from user mode?

> >> Does it mean that the entire interface will change?
> >
> >All verbs that result in command interface calls would be issued via IOCTL.
> >It would require the ULPs to change to allocate, format, and issue the
> >IRPs. It allows using the I/O completion callback processing provided by
> >the OS rather than implementing custom callback mechanisms.  Think of it
> >as evolving the stack to be designed for Windows.
>
> If you look at the windows components that are "available" to the public, you
> will find out that they don't use IOCTLS to communicate one with each other.

Really?  Why is there a major IRP code of IRP_MJ_INTERNAL_DEVICE_IOCTL?  This
IRP code is specifically for kernel drivers to communicate with one another.

> >> 2) Currently the hardware driver doesn't support operations at dispatch
> >> mode (that is Create-QP will block bellow the IBAL library). As a result
> >> I don't understand how you are going to establish the goal of allowing
> >> all operations to work from dispatch level.
> >
> >The HCA driver will need to be fixed.  Requiring all verbs to be issued at
> >passive, especially considering the command interface is asynchronous, is a
> >lousy design and imposes all sorts of restrictions on kernel clients.  This
> >design flaw requires ULPs to have passive level threads to perform work -
> >work which may be delayed by I/O completion processing from other clients.
> >It will also enable new functionality in ULPs that do not have a flexible
> >way to get into a passive level thread context.
>
> The current version of the driver doesn't support this. The next version of
> the driver will also not support this (at  least this is the plan for now).
> How are you going to work over this problem?

It is quite disappointing that the new HCA driver perpetuates this flaw.  Before
Leonid started work on the memfree driver, I raised the issue (both with him and
Gilad) of the need to support verb calls at dispatch.  I even sent Leonid some
code that I had started to work on to do exactly this.

Requiring verbs to be done at passive level is imposing policy on kernel clients
that their driver model may not allow.  This is definitely true of SRP, and to a
lesser extent IPoIB.  I've been complaining about this limitation for a very
long time, and apparently my recommendations have fallen on deaf ears.  Just
because it requires some work is not a reason not to do it.  We're here to do
the right thing, and the current HCA driver is not it.

This is a software problem, and we're here to make this software fit into the
Windows driver environment as much as possible.  In Windows, drivers spend quite
a bit of time working at dispatch level.

If you look at how kernel clients use sockets today, they are limited to
creating sockets at passive level only.  This is a big enough problem that
Microsoft developed WSK (Windows Sockets for Kernel), and one of the big
advantages of WSK is that it allows sockets to be created and manipulated at
dispatch level.  Why should we make a similar mistake with verbs, and not take a
hint from Microsoft's own development efforts?

> One more issue to notice is that although the command interface allows for
> calls to be issued at dispatch level, it doesn't allow infinite requests
> simultaneously. This requires locks, queues and so on which will move the
> complication to other places.

You already have locks, queues, and so forth.  Whether the command interface is
synchronous or not, you *must* be able to queue up requests from multiple
clients so that you don't exceed the maximum number of commands outstanding.

Just looking at main_cmd_flow in
hw\mt23108\vapi\Hca\hcahal\tavor\cmdif\cmdif.c at 953 and following the code path
for event-driven commands there are accesses to:
- 2 spinlocks (each acquired/release multiple times)
- 2 semaphore(to control queue depth)
- 1 mutex
- 1 event (command completion)

Changing to asynchronous usage will eliminate the semaphores and events and
replace them with a doubly linked list.  How hard are doubly linked lists to
use?

Queuing IRPs is well documented and understood component of Windows drivers.
Adding IRP queuing to the HCA driver should be straightforward.

> >> 3) What is the impact on the client that you see from this change? Will
> >> it bring higher BW? Lower latency? Increase of connection rate?
> >
> >This will simplify the code base considerably by eliminating context
> >switches currently required to account for a bad HCA driver model (and
> >the reference counting those context switches require).  It also creates
> >a common path for kernel and user-mode verb processing, and lets clients
> >decide how they want to process their verbs.  In effect, it removes
> >policy imposed by the HCA driver (not the HW!).
> >
> >For IBAL as a client, it allows elimination of the destroy thread, the
> >local mad processing thread, passive threads for verb processing, and the
> >asynchronous destroy callback mechanism to name a few (I'm probably
> >missing others).
>
> I believe that although the new model will allow removing some of the threads
> that you have mentioned there will probably be other instead of that.

This is only true if you make the minimal changes to the HCA driver - instead of
IBAL providing threads and all the overhead of managing IRQL limitations that
functionality would be moved into the HCA driver (where it really should be
since it's an HCA driver limitation).  If the HCA driver is written properly,
then you actually *can* eliminate threads (and the context switches they
require).

> I also
> believe that using reference count on the objects will help making the
> destruction of objects simpler. We have to make sure that we have a complete
> design before we start coding.

Yes, reference counting will still be required.  It will be less complicated.
Less complicated is always a good thing IMO.

- Fab