[ofw] RE: [PATCH] Support for ND over WV (incomplete)

Tue May 12 15:36:01 PDT 2009

>> Here's the prototype for the DisconnectAndFlush method that would be
>> needed for an ND provider over WV.
>> 
>> I don't have any code to implement that, but for now it could be
>> implemented as returning an error status.  This mainly gets the WV
>> interfaces in place for the 2.1 release.
>> 
>> Hopefully not too many things pop up when we get to the actual
>> implementation.
>> 
>> I changed the interface GUID since the interface changed.
>> 
>> Thoughts/comments?
> 
> I believe these are the issues that need to be dealt with:
> 
> - We can safely modify the QP state to error when:
> 	a) a DREQ has been received
> 	b) a DREP has been received
> - CM callbacks are at dispatch.
> - For a) we can modify to error in the downcall from userspace
>   We just need to know if a DREQ has been received
>   (Note that all other QP modify calls are done in downcalls.)

Just to clarify, you would do the QP modify only if the DREQ was received, yes?  If you modify before the DREQ (or DREP) was received from the other side, you risk getting a timeout retry error on the last send at the remote peer.  You cannot issue the DREQ and immediately move to error.  Tried it, it doesn't work.

> - For b) we cannot call modify in the CM callback
> 
> The last bullet is what needs to be dealt with.  The kernel could store
> the QpId with the WV_ENDPOINT structure when EP:Accept() is called, but
> it still needs to queue the modify call to another thread.

Yep, this is what's done in the ND proxy code - queue an I/O work item to do the QP modification.  It's easy to allocate the work item when the IOCTL first comes down, to eliminate the risk of allocation failure in the CM callback, if desired.

Did I ever send my patch to change WvQpModify to be async?  It used an I/O work item to delay the QP modification.  A similar technique could be used in the DisconnectAndFlush case.

> The WV ND provider could probably handle this by allocating a thread and
> separating the ND:Disconnect call into two WV calls, Disconnect and
> Modify.  It should be possible to handle this in the kernel as well, but
> it still requires a thread switch.  I think we need to look at the
> advantages of these two approaches, unless someone can come up with a
> third approach.

I think handling it in the kernel is preferable to having an extra thread per process in the ND provider just to handle disconnection.  The I/O work items get queued to the system thread pool, so end up sharing resources that already exist.  As the number of cores per system goes up, having a bunch of extra threads just to handle this is a pain.  Plus, setting it up to be async in the kernel will lay the groundwork to getting things streamlined once the HCA driver supports QP modify at dispatch.

> The only advantage I can think of for handling this in the kernel
> depends on having a QP modify call that can occur at dispatch.

Even without the HCA driver support QP modify at dispatch (have I mentioned it should?) there's added flexibility provided by having this handled asynchronously in the kernel, as I had demonstrated with my prototype work with the QP Modify calls in WinVerbs.

> Regardless of which approach is used, I think we can handle it without
> changing the API or IOCTL interfaces.

If you don't want threads in the ND provider, you have to do this via an IOCTL/API change.   The completion of the disconnect gets reported to the user, you have no opportunity to intercept and issue flush the QP.

The third way would be to do a synchronous disconnect followed by an asynchronous QP modify (but that too is handled synchronously in the existing code), but this will have an effect on process cleanup and overall runtimes.  Basically, it won't scale well due to serialization if the client is single threaded (which is the case the majority of the time.)

-Fab