[nvmewin] Question on Processor to MSI vector mapping

Greg de Valois Greg.deValois at SanDisk.com
Wed Feb 15 14:31:15 PST 2012


Paul:

Ok, what I wasn't getting is that you're allocating the queue pairs using SpecifyCache with the proc number. Got it.

But if you think that having the SQ queue memory itself allocated close to the associated core doesn't matter so much (I would think it wouldn't, since the driver is accessing other memory that is not associated with that core either), now I don't know why you wouldn't then just use the StorPort API.

Also, I'm not sure what you mean by the following:

Would still avoid the storport API because I don’t think it adds value over the learned method and requires us to use the DPC steering option which I’ve witnessed to have unpredictable side effects.

By "DPC steering" I assume you mean enabling the STOR_PERF_DPC_REDIRECTION flag for StorPortInitializePerfOpts, which I see the driver is doing, or are you referring to something else?

But I understand what you're doing now with the core-vector mapping. Thanks for the explanation.

Greg

________________________________
From: Luse, Paul E [paul.e.luse at intel.com]
Sent: Wednesday, February 15, 2012 1:20 PM
To: Luse, Paul E; Greg de Valois; nvmewin at lists.openfabrics.org
Subject: RE: Question on Processor to MSI vector mapping

Ugh, small but important correction below

From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Luse, Paul E
Sent: Wednesday, February 15, 2012 2:16 PM
To: Greg de Valois; nvmewin at lists.openfabrics.org
Subject: Re: [nvmewin] Question on Processor to MSI vector mapping

The only difference between what we’re doing and what you get with either the storport API or the ‘learning method’ I mention is the locality of the SQ memory when the IO is submitted.  With what we’re doing now, we know that the submission core is local to SQ mem.  We don’t have that guarantee otherwise.  Its negligible I believe which is why I’m proposing we move to the learning mode (very small cod change, ‘ve made it and tested).  I prefer learning mode over the storport API because it works regardless of OS dependencies, APIC modes, etc.

As an example of the tiny SQ mem difference, take this scenario

Core 0:  SQ0 and CQ0 are allocated local to core 0.
Core 1:  SQ1 and CQ1 are allocated local to core 1.
Core 2:  SQ2 and CQ2 are allocated local to core 2.

In our current implementation if we get an IO coming in on Core 1:

We submit it to SQ1 (local mem), when we created CQ we knew the vector so we also know that this IO will complete on CQ1

Using the storport API, if we get an IO coming in Core 1:

There’s no assurance that we’ll be using SQ1.  If the API tells us that the vector is 2, for example, we’d then choose to submit on *SQ2* which is local to core 2, not core 1.  On the completion side, everything is just as optimal.  The IO completes on core 2 which both SQ/CQ for core2 are local to core 2

Does that help?

Thx
Paul


From: Greg de Valois [mailto:Greg.deValois at sandisk.com]<mailto:[mailto:Greg.deValois at sandisk.com]>
Sent: Wednesday, February 15, 2012 2:03 PM
To: Luse, Paul E; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: Question on Processor to MSI vector mapping

I'm not understanding why it matters which core has been assigned which vector, as we have IO associated with a vector, put onto a queue associated with that vector, and the controller completes the IO using that vector.

Why should the driver and controller care what the mapping of vectors to processors is?

Thanks, Greg

________________________________
From: Luse, Paul E [paul.e.luse at intel.com]
Sent: Wednesday, February 15, 2012 12:46 PM
To: Greg de Valois; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: Question on Processor to MSI vector mapping
I think the point you’re missing (or I’m not understanding) is when the driver creates the CQ it tells the device which MSIX vector to use for that CQ.  We don’t know that at the time we create the CQ.  All we know when we create the CQ is the number of vectors and the vector numbers but we don’t know which one is associated with which core.  The API only provides that info on a per IO basis.  So, everything you describe is correct however the downside is (not that big of a deal but as compared to what we’re doing now) that because we don’t know the mapping when we create the SQ/CQ pair we may be placing the IO on a different core than the SQ is associated with because the API told us about a vector that is associated with a different Q pair than that we setup for the current core.

Feel free to call me if that would be easier – 480 554 3688

Thx
Paul

From: Greg de Valois [mailto:Greg.deValois at sandisk.com]<mailto:[mailto:Greg.deValois at sandisk.com]>
Sent: Wednesday, February 15, 2012 1:40 PM
To: Luse, Paul E; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: Question on Processor to MSI vector mapping

My understanding is that Windows assigns the number of MSI vectors requested (if possible) by the INF to an adapter, each of which is associated with one of the available processors. The driver sets up a queue pair for each of those vectors, and when an IO comes, it is put on the queue associated with the vector that is returned from the API call. This vector is one that has been assigned to the current executing core. As long as that API works correctly, the driver doesn't have to do anything more to assure that the IO is submitted and completed on that vector.

Is there some part of those statements that isn't correct?

I'm a little confused here.

Greg

________________________________
From: Luse, Paul E [paul.e.luse at intel.com]
Sent: Wednesday, February 15, 2012 11:14 AM
To: Greg de Valois; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: Question on Processor to MSI vector mapping
Because we create the SQ and CQ before the first IO and we have to provide the CQ vector when created.  So, for example lets say we have 3 cores and an SQ/CQ pair all numbered the same as the core #.  When we create them we arbitrarily give the CQ a vector to complete on.  When the first OI comes lets say its on core 1 and the stoport PA tells us the vector we should expect it on is 3, we’d need to submit on SQ3 but we’re on core 1.  There are ways around this of course, we could do some creative things in passiveInit wrt the creation of the CQ’s and submitting test IOs, re-creating, etc., but those things I don’t think are worth the complexity.

Make sense?

Thx
Paul

From: Greg de Valois [mailto:Greg.deValois at sandisk.com]<mailto:[mailto:Greg.deValois at sandisk.com]>
Sent: Wednesday, February 15, 2012 12:06 PM
To: Luse, Paul E; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: Question on Processor to MSI vector mapping

Paul:

Thanks for the reply. I'm afraid I'm not quite following you: why does using the API provided by Storport to get the MSI vector associated with the current request imply that you're submitting on a different core than completing? The core you're executing on is the one assigned the vector that the API gives you, you put it on the submission queue that has been assigned to that vector, and it's completed on the completion queue for that vector as well.

What am I missing here?

Greg

________________________________
From: Luse, Paul E [paul.e.luse at intel.com]
Sent: Wednesday, February 15, 2012 10:57 AM
To: Greg de Valois; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: Question on Processor to MSI vector mapping
Hi Greg-

Thanks for the question, it’s a good one!  The reasoning was mainly because it was fairly straightforward and had one benefit over the method you mention – we could assure NUMA optimization and vector matching for both SQ and CQ.  There are many, many ways that one could approach this problem and we discussed  a few as part of the dev of this driver and then individually discussed experiences with various methods with other dev teams at our respective companies.

If one uses the method you mention below, we’d create our SQ/CQ per core NUMA optimized and then be submitting on a different core than we complete on however we’d still be optimizing the completion side.  The more I thought about this the more I realize its actually not buying us much of anything over using the Msft API due to the effects of CPU cache and the fact that the SQ access on submission is a write and not a read.  I also heard from various other folks that they didn’t find the API below to be accurate all of the time, I can’t say from experience that I’ve seen this.

That said, I will likely be proposing an alternate method in the near future so I’ll go ahead and propose it now since you brought up the subject:

Proposal:  no long decompose the MSI address to populate the mapping table.  Instead, start off with a 1:1 mapping and ‘learn and update’ the mapping table on the completion side.  Would still avoid the storport API because I don’t think it adds value over the learned method and requires us to use the DPC steering option which I’ve witnessed to have unpredictable side effects.  I do plan on following up with Msft (and have already had several internal discussions at Intel with other storport devs) on exactly how some of these optimizations within storport so we can better gauge whether they’re a good fit for us or now.

Pro:  This will *always* work whereas the method we have no does not work for APIC logical mode.  I prefer a simple “one size fits all” solution every time over a 2 path solution even if one path is slightly optimized.  It makes the driver more maintainable and gives us less variables in debug (right now we don’t even store whether we found the APIC in phy or logical mode so during debug you don’t really know).

Con:  SQ mem will be on a different core than the submitting thread but I don’t believe this is a measurable issue.  Certainly can perform some experiments to check though

Other thoughts?

Thx
Paul

From: nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org> [mailto:nvmewin-bounces at lists.openfabrics.org]<mailto:[mailto:nvmewin-bounces at lists.openfabrics.org]> On Behalf Of Greg de Valois
Sent: Tuesday, February 14, 2012 5:02 PM
To: nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: [nvmewin] Question on Processor to MSI vector mapping

All:

I'm wondering if anyone can explain to me the reasoning behind the processor to MSI vector translation that is being done by the driver, instead of using the vector returned from StorPortGetStartIoPerfParams for each IO? Are there cases where this doesn't work properly?

Thanks,

Greg de Valois
SanDisk

________________________________
PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20120215/1e0b690d/attachment.html>


More information about the nvmewin mailing list