[ofw] RE: NetworkDirect over WinVerbs

Wed Feb 11 11:35:23 PST 2009

>> ND::Disconnect is always called in the context of the user's thread. In
>> any case, it doesn't matter - the output buffer will always be provided
>> in the context of the user's thread, but there's no requirement for the
>> modification to be done in the context of the user's thread.  I'll even
>> assert that requiring QP modification to be done in the user's thread
>> context limits scalability and is a lousy design.
>  The OFED stack has blocking calls, QP modify is done entirely in the
> user's thread, every modify operation requires an IOCTL, and it still
> scales to hundreds of thousands of connections.  Plus the synchronous
> operation makes it easier to program to.  I'll assert that your
> assertion is bogus. :b

The app should be responsible for deciding whether it wants the simplicity of synchronous operations or the flexibility of asynchronous operations.  The fact that the driver imposes a blocking model is a flaw in the driver model, especially since the driver internally processes things asynchronously.

> Deferring the modify call to an arbitrary thread greatly complicates the
> locking, destruction handling, and device removal.  You're overlooking
> some vital implementation details that not only affect scalability, but
> interface viability.  The modify call is initiated in the user's thread
> - it could complete at a later time if anything supported a non-blocking
> modify call, but nothing does, and I don't see any indications that that
> will change.

You have to handle that locking anyway for the case where a multi-threaded app misbehaves (can't crash the system).  I sincerely hope that someday the HCA driver will have a better threading model.  It's taken over 4 years for WinVerbs to happen (probably only because NetworkDirect came along), is it that outlandish that maybe in another 4 years the HCA driver would be designed and implemented with Windows in mind?

You know what, I'll put together a patch that will move the QP modification to a work item context and prove that it doesn't complicate locking, destruction handling, or device removal.

> The alternative is to block the CM thread while modify
> completes, which has a direct effect on the ability to process
> connections in a timely manner.

You're just moving the blocking around - you're serializing connection establishment in the application's thread.  Blocking the app's thread 'has a direct effect on the ability to process connections in a timely manner' too.

>> You don't need them to reference one another at all.  In fact you
>> should avoid it.  But there's no reason both the QP handle and the CEP
>> handle can't be provided in the same IOCTL, that first performs the CEP
>> operation, and when that completes, performs the QP operation.  The two
>> objects are still independent, the IOCTL just has multiple phases, with
>> each phase operating on a different object.
>  Notification of the CEP operation completing means that winverbs is now
> in an arbitrary thread context.  The QP could have been destroyed or be
> in the process of being destroyed.

The application context is alive and well as long as an IRP is outstanding.  There's nothing that prevents you from doing a handle lookup from an arbitrary thread if you already have the application context.  The only thing you need to be in the app's thread context for is the initial lookup of the application context - once you have that, and the IRP is outstanding, the app context can't go away (the thread won't exit while there are IRPs outstanding, and the file object can't be closed while IRPs are outstanding, so if your app context gets cleaned up when the file gets closed then it's already handled for you by the system.)

>  The device could be going away.

That's true of any IOCTL call.

> Handling this is non-trivial.

You already have to handle it.

>  The locking now has to change to
> accommodate the callback thread context.

Why?  The QP modify IOCTL does:
1. a lookup of the application context
2. a handle check on the QP ID
3. if the handle check succeeds, it performs the modify.

You already have to handle the device going away between steps 2 and 3.

In the CM callback you have a WDFREQUEST, so you can do these exact steps.  The only issue is related to the CM callback being invoked at DISPATCH_LEVEL, and the QP Modify being a synchronous call.  That problem would go away if the HCA driver model didn't suck.  In the mean time you can allocate a work item and queue it for processing at passive level.  You still have the IRP outstanding, so the app context won't go away.

You don't need to store the QP ID in the CM ID, or vice versa - both IDs are provided by the user and can be treated independently.  You can look the application context up at anytime given an arbitrary WDFREQUEST object in an arbitrary thread context - there's nothing in the code that requires being in the app's thread context.

>  Device removal now has to
> worry that a CM thread may be running when it's trying to release
> hardware resources associated with a QP that may or may not be
> associated with a CEP.

The QP association with a CEP is irrelevant.  You have a request to modify the QP while the device is getting removed.  This can happen anyway.

> The only issue with winverbs is that it doesn't invoke the IB protocol
> in the way that ND is trying to mandate.

Yes, Winverbs does not support NetworkDirect nicely.  It's very close, but not quite there yet.  It would be really nice to have a clean implementation of NetworkDirect over WinVerbs.  What's wrong with adding a method to disconnect and flush.  There's precedent for methods of the IWVConnectEndpoint taking a QP as input and performing QP state changes: Connect and Accept.  Add:

STDMETHOD(DisconnectAndFlush)(
        THIS_
        __in IVWConnectQueuePair* pQp,
        __in_opt OVERLAPPED* pOverlapped
        ) PURE;

There's no more dependency between the two objects in the kernel code than exists today with Connect and Accept.

> What would be nice is a way to direct an operation on a file to use or
> not use an I/O completion port on a per operation basis, or allow a file
> to support multiple I/O completion ports.

You can control whether I/O requests complete to the IOCP or not by setting the lowest bit of the event handle in the OVERLAPPED structure that you specify in your overlapped operation.  From the GetQueuedCompletionStatus docs:

"Even if you have passed the function a file handle associated with a completion port and a valid OVERLAPPED structure, an application can prevent completion port notification. This is done by specifying a valid event handle for the hEvent member of the OVERLAPPED structure, and setting its low-order bit. A valid event handle whose low-order bit is set keeps I/O completion from being queued to the completion port."

You can't have a file bound to multiple IOCPs.  But you can have multiple files open to the same device, and the kernel driver can associate multiple files with the same application context if it so desired.

-Fab