[ofw] winverbs ND provider

Thu Jan 14 09:52:14 PST 2010

Hi Tzachi

Tzachi Dar wrote on Thu, 14 Jan 2010 at 08:38:56
> The good old ibal has managed to do arm without going from user to kernel.

Actually, the arm operation itself doesn't go to the kernel, but the internal IBAL threads (if using callbacks) will send an IOCTL down to the kernel for notification.  If using events, the kernel will signal the event that was specified when the CQ was created.

For the callback case, you still have a kernel transition - it's just in a different thread context, and it happens once per CQ notification.

>> How was IBAL able to eliminate kernel transitions for CQ
>> notifications?
> 
> One of the mechanism that IBAL was using was events that were shared
> between the user and kernel.
> When the kernel has to signal something it signals the events, this is
> much more efficient.
> On the other hand it is not as general as the overlapped mechanism.

Correct, and the event mechanism is limited by WaitForMultipleObjects since the number of events you can wait for is limited.  The event mechanism works great if you have a thread per CQ, but then you have thread context switches.  As soon as you try to do more than one thing with your thread it becomes a burden.

>>> There is another issue that one should think of: Using fast dispatch
>>> routines instead of the normal dispatch routines. This technique
>>> decreases the time it takes to move from user to kernel dramatically.
>>> Since the nd_conn test is moving many times from user to kernel it
>>> will improve things.
>> 
>> The IBAL ND provider doesn't do this, though, so the perf
>> difference can't be related to fast dispatch.  Further, the
>> operations that are performed by the async operations
>> (Connect, CompleteConnect, Accept, Disconnect) are all
>> asynchronous.  The latter three perform QP state
>> modifications, which could thus never be handled in a fast
>> dispatch routine.  The only call that might benefit from a
>> fast dispatch handler is GetConnectionRequest, since there
>> may be a request there already.  I don't know if the fast
>> dispatch routines are beneficial when using I/O completion
>> ports, though.
> 
> It is possibale to design a completely different mechanism for doing
> operations between the kernel and the user. It should probably save a
> few us for each operation. Since we are currently dealing with around
> 800us I guess that there are many other Issues to solve before we get
> there.

Right, I would much rather see a better designed HCA driver before we go playing around with fast dispatch routines.  Most of the operations that go to the kernel are expected to be asynchronous, thus the fast dispatch wouldn't work anyway.

-Fab