[ofw] RE: [PATCHv2] WinVerbs: Make QP modification asynchronous

Thu Feb 12 18:53:04 PST 2009

>> The app isn't multithreaded so I ran two apps side by side:
>>
>> Async: 3562
>> Sync: 2705
>
> Any theories on why the async case increased so much versus your
> previous run?

Sync went from 1801 to 2705, a 50% increase.
Async went from 2634 to 3562, a 35% increase.

In both cases, the rate went up by about 900.  I'd attribute that to QP creation/destruction being done two by two now, rather than one by one.

> Note that the 2-'threaded' sync case now outperforms the
> 1-thread async case with work items.  I'm guessing this is the cost of
> the context switch.

The 2-threaded sync case should outperform the 1 thread async case, because you can create and destroy two QPs at the same time, which you can't in the single threaded case.  Remember my anecdote about not releasing the QPs and how that gave a ~400 boost to the results?

So I modified the test to create the QPs outside of the timed loop, and likewise destroy them after the timed loop.  So the test now times only the QP transitions to INIT, RTR, RTS, and ERROR.

User/Kernel: 1 process/2 processes
Sync/Async: 3012/4400
Async/Async: 7168/7050
Sync/Sync: 3363/4902
Async/Sync: 3289/4764

For the Async/Async test I changed the test to use 20K QPs...  This is the last test I ran so I didn't feel like rerunning the others.

>> The CPU wasn't pegged in either of these tests, so I ran again with 4
>> apps side by side...
>>
>> Async: 3933
>> Sync: 3869
>
> I'm surprised by this.  I would have expected the async to do worse at
> some point.

Why?  The system won't create a thread per request, so work items can queue up on the system threads and get executed serially, but without the transitions between user and kernel mode (those end up overlapping the modify).

>> Still not CPU bound, but close and I think the part that isn't CPU
>> bound is due to QP creation and destruction.  It's hard to get all 4 to
>> start at the same time, too, so there's a little skew between them.
>
> As long as the test run is sufficiently long, the skew shouldn't really
> matter.

I tried to keep the tests running more than a second.  My original loop of 8 test runs didn't work for the async/async test above, the skew would in process startup would mean that by the time the test ended, the processes each reported results over 6000.  I had to run them one at a time.

> I was hoping this would be an easy change to your test app.
> (E.g. pass in NULL for overlap when modifying the QP, and winverbs will
> do the operation synchronously.)

Oh, I forgot about this...  Wasn't as simple of a change as I'd hoped, but not hard...

Sync: 1806
Async: 1672

That's a lot more significant than I would have expected...

I suppose we could tell the kernel driver whether the call is intended to be synchronous or not so that the sync behavior would skip the work item, but still get the extra performance for the async case.

-Fab