From Alex.Chang at idt.com Wed Feb 1 20:54:44 2012 From: Alex.Chang at idt.com (Chang, Alex) Date: Thu, 2 Feb 2012 04:54:44 +0000 Subject: [nvmewin] Patch Review Request In-Reply-To: <45C2596E6A608A46B0CA10A43A91FE1602FC6068@CORPEXCH1.na.ads.idt.com> References: <82C9F782B054C94B9FC04A331649C77A033980@FMSMSX106.amr.corp.intel.com> <45C2596E6A608A46B0CA10A43A91FE1602FC6068@CORPEXCH1.na.ads.idt.com> Message-ID: <548C5470AAD9DA4A85D259B663190D367EEF@corpmail1.na.ads.idt.com> Hi all, The updated patch is attached. The changes are made per feedbacks in the code review last Tuesday. I had done all the tests to ensure Format NVM command is working fine. Please review it and let me know if you have any questions. Thanks, Alex ________________________________ From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Chang, Alex Sent: Friday, January 27, 2012 4:03 PM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] Patch Review Request Hi all, I am re-sending the requst. Please disgard the previous one I sent last week. Attached please find the implementation of handling Format NVM command and Namespace Hot Add/Remove via IOCTL Pass Through: - nvmeinit.c: Removed redundant codes in MSI enumeration. - nvmestd.c : Main implementation of Format NVM - nvmesnti.c : a. Includes blocking IO for the target namespace while it's under formatting and namespace add back logic b. Removed supporting Read/Write16 due to a bugcheck (0x19) when running SCSI Compliance 2.0 - nvmestd.h : Added definition of FORMAT_NVM_INFO structure. - nvmeioctl.h: a. Added three more error codes, associated with Format NVM, that can be returned in ReturnCode of SRB_IO_CONTROL . b. Added two IOCTL codes, NVME_HOT_ADD_NAMESPACE and NVME_HOT_REMOVE_NAMESPACE. Now, Format NVM can be done in two ways: 1. A single Format NVM command via IOCTL Pass Through request. The steps driver takes to complete the request are: a. Removes the target namespace(s) first b. Issues Format NVM command c. Re-fetch Identify Controller structure d. Re-fetch Identify Namespace structure(s) e. Adds back the formatted namespace(s) 2. With NVME_HOT_ADD/REMOVE_NAMESPACE IOCTL calls and a Format NVM command via IOCTL Pass Through request: a. Issue NVME_HOT_REMOVE_NAMESPACE IOCTL request to remove the target namespace(s) b. Issue Format NVM command via IOCTL Pass Through request, which convers Step b, c and d of first method. c. Issue NVME_HOT_ADD_NAMESPACE IOCTl request to add back the formatted namespace(s) Thanks, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: FormatNVM_new.zip Type: application/x-zip-compressed Size: 92960 bytes Desc: FormatNVM_new.zip URL: From paul.e.luse at intel.com Thu Feb 2 11:29:50 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Thu, 2 Feb 2012 19:29:50 +0000 Subject: [nvmewin] first release thoughts Message-ID: <82C9F782B054C94B9FC04A331649C77A039820@FMSMSX106.amr.corp.intel.com> I believe we agreed on 2 patches prior to first release. 1) Alex's Format PT IOCTL: I understand the review went well so suspect as soon as Rick & Ray take a look at the final patch that will go in 2) My performance & stability patch: I'll rebase and re-test once Alex's goes in and then send out for eview Once mine goes in, wanted to level set real quick that we would have at least(more are welcome) IDT, LSI and Intel run the following in their environments and/or on QEMU: - Iometer script - BusTRACE SCSI check and busTRACE data integrity (for those who have it) - Msft SCSI compliance - Msft sdstress All of these will be run in the same manner as we ran them before and we'll document what that means for everyone else before the release and post notes with the release. I don't want to post the tools though, folks can grab those on their own if they'd like. I suspect this will put our first release in mid to late Mar. I'll probably schedule a short call around then so we can all confirm that we're ready and review what it is that we're posting for our very first binary release Thanks! Paul ____________________________________ Paul Luse Sr. Staff Engineer PCG Server Software Engineering Desk: 480.554.3688, Mobile: 480.334.4630 -------------- next part -------------- An HTML attachment was scrubbed... URL: From raymond.c.robles at intel.com Thu Feb 2 16:28:54 2012 From: raymond.c.robles at intel.com (Robles, Raymond C) Date: Fri, 3 Feb 2012 00:28:54 +0000 Subject: [nvmewin] nvmewin DB is locked - Merging Format NVM PT IOCTL changes Message-ID: <49158E750348AA499168FD41D889836004B585@FMSMSX105.amr.corp.intel.com> All, I'm merging IDT's patch for Format NVM PT IOCTL implementation. The database is locked. Thanks, Ray [Description: Description: Description: Description: cid:image001.png at 01CB3870.4BB88E70] Raymond C. Robles PCG Server Software Engineering Data Center and Connected Systems Group | Intel Corporation Office - 480.554.2600 | Mobile - 480.399.0645 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 1756 bytes Desc: image001.png URL: From raymond.c.robles at intel.com Thu Feb 2 17:50:45 2012 From: raymond.c.robles at intel.com (Robles, Raymond C) Date: Fri, 3 Feb 2012 01:50:45 +0000 Subject: [nvmewin] nvmewin merge completed - Format NVM PT IOCTL changes Message-ID: <49158E750348AA499168FD41D889836004B641@FMSMSX105.amr.corp.intel.com> All, I've completed the merge of the Format NVM PT IOCTL changes from IDT (with much help from Jo Delsey on the SVN stuff). Thanks Jo! Please update and rebase your current views. As you'll notice, Jo and I created 3 new directories - trunk, branches, tags. The trunk directory now contains the docs and source directory and will always contain the latest code (our main trunk). The branches directory will contain our releases branches (currently this directory is empty as we have not had an official release yet). The tags directory will contain copies of the trunk when a particular patch is applied. For this first patch, we created a tag for and that created a directory in within the tags directory. I would propose that we use tags when applying patches going forward for maintainability. Anyone currently working on a fix (which I think is only Paul right now) will just need to back up your changes and do an "Update" and you'll see the new directories... then just merge in your changes. Paul, you are up next for the patch queue. Thanks, Ray [Description: Description: Description: Description: cid:image001.png at 01CB3870.4BB88E70] Raymond C. Robles PCG Server Software Engineering Data Center and Connected Systems Group | Intel Corporation Office - 480.554.2600 | Mobile - 480.399.0645 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 1756 bytes Desc: image001.png URL: From paul.e.luse at intel.com Fri Feb 3 10:18:45 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Fri, 3 Feb 2012 18:18:45 +0000 Subject: [nvmewin] ***UNCHECKED*** REVIEW REQUEST: patch for perf fix and other misc updates Message-ID: <82C9F782B054C94B9FC04A331649C77A03A87B@FMSMSX106.amr.corp.intel.com> Password for zip is nvme1234 Main changes: (based on tags\format_nvm_pt_ioctl) 1) Fix DPC watchdog timeout under heavy IO by adding call to ste per lun storport queue depth following INQ 2) Added support for optional (compile switch) DPC or ISR completions. Defaulting to DPC as this is 'standard' recommended method 3) Updated mode block descriptor creation to return all F's for # blocks if namespace is too big to fit in field (per SPC) 4) Changed logical mode to 1:1 map cores to MSIX vectors, not optimal for vector matching but better than sending all IO through one core and we're covered in any scenario wrt protection on submit/complete 5) Pile of CHATHAM only changes 6) Changed passiveInit to wait for state machine to complete based on lots of issues with us missing enum because we weren't ready and storport doesn't retry the early enum commands. Ran into this at Msft as well as UNH when using the chatham in various platforms. Ray also got it with QEMU on his HW (different speed than mine) Tested (2008-R2 with Chatham and Win7-64 with QEMU, with and without driver verifier): - Sdstress - SCSI compliance (write 10 fails, not clear why as trace shows no issue. Fails with baseline code also, note related to these changes) - BusTRACE scsi compliance - BusTRACE data integrity - Iometer all access specs, Q depth 32 8 workers Changes: Nvme.inf: - Updated version Nvmeinit.c - Misc asserts added, some braces added here and there for readability - NVMeMsixMapCores(): changes to support logical mode using all cores/all vectors 1:1 mapped - Misc chatham changes - Compile switch for DPC or ISR Nvmeio.c - New assert nvmePwrMgmt.c - Chatham only changes nvmeSnti.c - SntiTranslateCommand() added adapter ext parm for use by API to set per lun Q depth, also set Q depth post INQ - Bunch of chatham changes - SntiCreateModeParameterDescBlock() added code to correctly fill in # blocks when we overflow nvmeSnti.h - Defines used by Q depth setting, function proto changes nvmeStd.c - DPC vs ISR compile switches - PassiveInit waits on init state machine now - Removed storport perf opt, has on effect based on our mapping - Changed assert checking on vector/proc mapping so it doesn't affect admin queue, is ignored for QEMU and for logical mode - NVMeIsrMsix: fixed issue where shared mode would cause BSOD - Added ISR completion support - Chatham changes nvmeStd.h - Supporting sturct changes Sources - New compile switches for ISR vs DPC and for QEMU - ____________________________________ Paul Luse Sr. Staff Engineer PCG Server Software Engineering Desk: 480.554.3688, Mobile: 480.334.4630 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: source.zip Type: application/x-zip-compressed Size: 154806 bytes Desc: source.zip URL: From Alex.Chang at idt.com Tue Feb 7 08:46:50 2012 From: Alex.Chang at idt.com (Chang, Alex) Date: Tue, 7 Feb 2012 16:46:50 +0000 Subject: [nvmewin] first release thoughts In-Reply-To: <82C9F782B054C94B9FC04A331649C77A039820@FMSMSX106.amr.corp.intel.com> References: <82C9F782B054C94B9FC04A331649C77A039820@FMSMSX106.amr.corp.intel.com> Message-ID: <548C5470AAD9DA4A85D259B663190D3689ED@corpmail1.na.ads.idt.com> Hi Paul, For the first binary release, does openfabircs.org have its own certificate, etc. to sign it? Thanks, Alex ________________________________ From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Luse, Paul E Sent: Thursday, February 02, 2012 11:30 AM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] first release thoughts I believe we agreed on 2 patches prior to first release. 1) Alex's Format PT IOCTL: I understand the review went well so suspect as soon as Rick & Ray take a look at the final patch that will go in 2) My performance & stability patch: I'll rebase and re-test once Alex's goes in and then send out for eview Once mine goes in, wanted to level set real quick that we would have at least(more are welcome) IDT, LSI and Intel run the following in their environments and/or on QEMU: - Iometer script - BusTRACE SCSI check and busTRACE data integrity (for those who have it) - Msft SCSI compliance - Msft sdstress All of these will be run in the same manner as we ran them before and we'll document what that means for everyone else before the release and post notes with the release. I don't want to post the tools though, folks can grab those on their own if they'd like. I suspect this will put our first release in mid to late Mar. I'll probably schedule a short call around then so we can all confirm that we're ready and review what it is that we're posting for our very first binary release Thanks! Paul ____________________________________ Paul Luse Sr. Staff Engineer PCG Server Software Engineering Desk: 480.554.3688, Mobile: 480.334.4630 -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.e.luse at intel.com Tue Feb 7 11:14:24 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Tue, 7 Feb 2012 19:14:24 +0000 Subject: [nvmewin] first release thoughts In-Reply-To: <548C5470AAD9DA4A85D259B663190D3689ED@corpmail1.na.ads.idt.com> References: <82C9F782B054C94B9FC04A331649C77A039820@FMSMSX106.amr.corp.intel.com>, <548C5470AAD9DA4A85D259B663190D3689ED@corpmail1.na.ads.idt.com> Message-ID: No. We can't sign it right now. I'll check with OFA to see if there's any precedent though Sent from my iPhone On Feb 7, 2012, at 9:49 AM, "Chang, Alex" > wrote: Hi Paul, For the first binary release, does openfabircs.org have its own certificate, etc. to sign it? Thanks, Alex ________________________________ From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Luse, Paul E Sent: Thursday, February 02, 2012 11:30 AM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] first release thoughts I believe we agreed on 2 patches prior to first release. 1) Alex’s Format PT IOCTL: I understand the review went well so suspect as soon as Rick & Ray take a look at the final patch that will go in 2) My performance & stability patch: I’ll rebase and re-test once Alex’s goes in and then send out for eview Once mine goes in, wanted to level set real quick that we would have at least(more are welcome) IDT, LSI and Intel run the following in their environments and/or on QEMU: - Iometer script - BusTRACE SCSI check and busTRACE data integrity (for those who have it) - Msft SCSI compliance - Msft sdstress All of these will be run in the same manner as we ran them before and we’ll document what that means for everyone else before the release and post notes with the release. I don’t want to post the tools though, folks can grab those on their own if they’d like. I suspect this will put our first release in mid to late Mar. I’ll probably schedule a short call around then so we can all confirm that we’re ready and review what it is that we’re posting for our very first binary release Thanks! Paul ____________________________________ Paul Luse Sr. Staff Engineer PCG Server Software Engineering Desk: 480.554.3688, Mobile: 480.334.4630 From paul.e.luse at intel.com Tue Feb 14 09:19:23 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Tue, 14 Feb 2012 17:19:23 +0000 Subject: [nvmewin] patch reminder Message-ID: <82C9F782B054C94B9FC04A331649C77A04D485@ORSMSX152.amr.corp.intel.com> All- Although we're not on a strict timeline, I'd like to make sure patches don't sit for too long. This one is close to two weeks old, IDT & LSI can you guys take a few minutes to review and let Ray know if its good to go or comment otherwise? Note that its been running now on a test machine (full speed with Chatham hw and busTRACE 32 thread data integrity) for over a week now w/no issues. Thx Paul From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Luse, Paul E Sent: Friday, February 03, 2012 11:19 AM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] ***UNCHECKED*** REVIEW REQUEST: patch for perf fix and other misc updates Password for zip is nvme1234 Main changes: (based on tags\format_nvm_pt_ioctl) 1) Fix DPC watchdog timeout under heavy IO by adding call to ste per lun storport queue depth following INQ 2) Added support for optional (compile switch) DPC or ISR completions. Defaulting to DPC as this is 'standard' recommended method 3) Updated mode block descriptor creation to return all F's for # blocks if namespace is too big to fit in field (per SPC) 4) Changed logical mode to 1:1 map cores to MSIX vectors, not optimal for vector matching but better than sending all IO through one core and we're covered in any scenario wrt protection on submit/complete 5) Pile of CHATHAM only changes 6) Changed passiveInit to wait for state machine to complete based on lots of issues with us missing enum because we weren't ready and storport doesn't retry the early enum commands. Ran into this at Msft as well as UNH when using the chatham in various platforms. Ray also got it with QEMU on his HW (different speed than mine) Tested (2008-R2 with Chatham and Win7-64 with QEMU, with and without driver verifier): - Sdstress - SCSI compliance (write 10 fails, not clear why as trace shows no issue. Fails with baseline code also, note related to these changes) - BusTRACE scsi compliance - BusTRACE data integrity - Iometer all access specs, Q depth 32 8 workers Changes: Nvme.inf: - Updated version Nvmeinit.c - Misc asserts added, some braces added here and there for readability - NVMeMsixMapCores(): changes to support logical mode using all cores/all vectors 1:1 mapped - Misc chatham changes - Compile switch for DPC or ISR Nvmeio.c - New assert nvmePwrMgmt.c - Chatham only changes nvmeSnti.c - SntiTranslateCommand() added adapter ext parm for use by API to set per lun Q depth, also set Q depth post INQ - Bunch of chatham changes - SntiCreateModeParameterDescBlock() added code to correctly fill in # blocks when we overflow nvmeSnti.h - Defines used by Q depth setting, function proto changes nvmeStd.c - DPC vs ISR compile switches - PassiveInit waits on init state machine now - Removed storport perf opt, has on effect based on our mapping - Changed assert checking on vector/proc mapping so it doesn't affect admin queue, is ignored for QEMU and for logical mode - NVMeIsrMsix: fixed issue where shared mode would cause BSOD - Added ISR completion support - Chatham changes nvmeStd.h - Supporting sturct changes Sources - New compile switches for ISR vs DPC and for QEMU - ____________________________________ Paul Luse Sr. Staff Engineer PCG Server Software Engineering Desk: 480.554.3688, Mobile: 480.334.4630 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Greg.deValois at sandisk.com Tue Feb 14 16:01:36 2012 From: Greg.deValois at sandisk.com (Greg de Valois) Date: Tue, 14 Feb 2012 16:01:36 -0800 Subject: [nvmewin] Question on Processor to MSI vector mapping Message-ID: All: I'm wondering if anyone can explain to me the reasoning behind the processor to MSI vector translation that is being done by the driver, instead of using the vector returned from StorPortGetStartIoPerfParams for each IO? Are there cases where this doesn't work properly? Thanks, Greg de Valois SanDisk ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.e.luse at intel.com Wed Feb 15 10:57:30 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Wed, 15 Feb 2012 18:57:30 +0000 Subject: [nvmewin] Question on Processor to MSI vector mapping In-Reply-To: References: Message-ID: <82C9F782B054C94B9FC04A331649C77A056026@FMSMSX106.amr.corp.intel.com> Hi Greg- Thanks for the question, it's a good one! The reasoning was mainly because it was fairly straightforward and had one benefit over the method you mention - we could assure NUMA optimization and vector matching for both SQ and CQ. There are many, many ways that one could approach this problem and we discussed a few as part of the dev of this driver and then individually discussed experiences with various methods with other dev teams at our respective companies. If one uses the method you mention below, we'd create our SQ/CQ per core NUMA optimized and then be submitting on a different core than we complete on however we'd still be optimizing the completion side. The more I thought about this the more I realize its actually not buying us much of anything over using the Msft API due to the effects of CPU cache and the fact that the SQ access on submission is a write and not a read. I also heard from various other folks that they didn't find the API below to be accurate all of the time, I can't say from experience that I've seen this. That said, I will likely be proposing an alternate method in the near future so I'll go ahead and propose it now since you brought up the subject: Proposal: no long decompose the MSI address to populate the mapping table. Instead, start off with a 1:1 mapping and 'learn and update' the mapping table on the completion side. Would still avoid the storport API because I don't think it adds value over the learned method and requires us to use the DPC steering option which I've witnessed to have unpredictable side effects. I do plan on following up with Msft (and have already had several internal discussions at Intel with other storport devs) on exactly how some of these optimizations within storport so we can better gauge whether they're a good fit for us or now. Pro: This will *always* work whereas the method we have no does not work for APIC logical mode. I prefer a simple "one size fits all" solution every time over a 2 path solution even if one path is slightly optimized. It makes the driver more maintainable and gives us less variables in debug (right now we don't even store whether we found the APIC in phy or logical mode so during debug you don't really know). Con: SQ mem will be on a different core than the submitting thread but I don't believe this is a measurable issue. Certainly can perform some experiments to check though Other thoughts? Thx Paul From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Greg de Valois Sent: Tuesday, February 14, 2012 5:02 PM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] Question on Processor to MSI vector mapping All: I'm wondering if anyone can explain to me the reasoning behind the processor to MSI vector translation that is being done by the driver, instead of using the vector returned from StorPortGetStartIoPerfParams for each IO? Are there cases where this doesn't work properly? Thanks, Greg de Valois SanDisk ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -------------- next part -------------- An HTML attachment was scrubbed... URL: From Greg.deValois at sandisk.com Wed Feb 15 11:06:24 2012 From: Greg.deValois at sandisk.com (Greg de Valois) Date: Wed, 15 Feb 2012 11:06:24 -0800 Subject: [nvmewin] Question on Processor to MSI vector mapping In-Reply-To: <82C9F782B054C94B9FC04A331649C77A056026@FMSMSX106.amr.corp.intel.com> References: , <82C9F782B054C94B9FC04A331649C77A056026@FMSMSX106.amr.corp.intel.com> Message-ID: Paul: Thanks for the reply. I'm afraid I'm not quite following you: why does using the API provided by Storport to get the MSI vector associated with the current request imply that you're submitting on a different core than completing? The core you're executing on is the one assigned the vector that the API gives you, you put it on the submission queue that has been assigned to that vector, and it's completed on the completion queue for that vector as well. What am I missing here? Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 10:57 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Hi Greg- Thanks for the question, it’s a good one! The reasoning was mainly because it was fairly straightforward and had one benefit over the method you mention – we could assure NUMA optimization and vector matching for both SQ and CQ. There are many, many ways that one could approach this problem and we discussed a few as part of the dev of this driver and then individually discussed experiences with various methods with other dev teams at our respective companies. If one uses the method you mention below, we’d create our SQ/CQ per core NUMA optimized and then be submitting on a different core than we complete on however we’d still be optimizing the completion side. The more I thought about this the more I realize its actually not buying us much of anything over using the Msft API due to the effects of CPU cache and the fact that the SQ access on submission is a write and not a read. I also heard from various other folks that they didn’t find the API below to be accurate all of the time, I can’t say from experience that I’ve seen this. That said, I will likely be proposing an alternate method in the near future so I’ll go ahead and propose it now since you brought up the subject: Proposal: no long decompose the MSI address to populate the mapping table. Instead, start off with a 1:1 mapping and ‘learn and update’ the mapping table on the completion side. Would still avoid the storport API because I don’t think it adds value over the learned method and requires us to use the DPC steering option which I’ve witnessed to have unpredictable side effects. I do plan on following up with Msft (and have already had several internal discussions at Intel with other storport devs) on exactly how some of these optimizations within storport so we can better gauge whether they’re a good fit for us or now. Pro: This will *always* work whereas the method we have no does not work for APIC logical mode. I prefer a simple “one size fits all” solution every time over a 2 path solution even if one path is slightly optimized. It makes the driver more maintainable and gives us less variables in debug (right now we don’t even store whether we found the APIC in phy or logical mode so during debug you don’t really know). Con: SQ mem will be on a different core than the submitting thread but I don’t believe this is a measurable issue. Certainly can perform some experiments to check though Other thoughts? Thx Paul From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Greg de Valois Sent: Tuesday, February 14, 2012 5:02 PM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] Question on Processor to MSI vector mapping All: I'm wondering if anyone can explain to me the reasoning behind the processor to MSI vector translation that is being done by the driver, instead of using the vector returned from StorPortGetStartIoPerfParams for each IO? Are there cases where this doesn't work properly? Thanks, Greg de Valois SanDisk ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.e.luse at intel.com Wed Feb 15 11:14:31 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Wed, 15 Feb 2012 19:14:31 +0000 Subject: [nvmewin] Question on Processor to MSI vector mapping In-Reply-To: References: , <82C9F782B054C94B9FC04A331649C77A056026@FMSMSX106.amr.corp.intel.com> Message-ID: <82C9F782B054C94B9FC04A331649C77A05605E@FMSMSX106.amr.corp.intel.com> Because we create the SQ and CQ before the first IO and we have to provide the CQ vector when created. So, for example lets say we have 3 cores and an SQ/CQ pair all numbered the same as the core #. When we create them we arbitrarily give the CQ a vector to complete on. When the first OI comes lets say its on core 1 and the stoport PA tells us the vector we should expect it on is 3, we'd need to submit on SQ3 but we're on core 1. There are ways around this of course, we could do some creative things in passiveInit wrt the creation of the CQ's and submitting test IOs, re-creating, etc., but those things I don't think are worth the complexity. Make sense? Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 12:06 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Paul: Thanks for the reply. I'm afraid I'm not quite following you: why does using the API provided by Storport to get the MSI vector associated with the current request imply that you're submitting on a different core than completing? The core you're executing on is the one assigned the vector that the API gives you, you put it on the submission queue that has been assigned to that vector, and it's completed on the completion queue for that vector as well. What am I missing here? Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 10:57 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Hi Greg- Thanks for the question, it's a good one! The reasoning was mainly because it was fairly straightforward and had one benefit over the method you mention - we could assure NUMA optimization and vector matching for both SQ and CQ. There are many, many ways that one could approach this problem and we discussed a few as part of the dev of this driver and then individually discussed experiences with various methods with other dev teams at our respective companies. If one uses the method you mention below, we'd create our SQ/CQ per core NUMA optimized and then be submitting on a different core than we complete on however we'd still be optimizing the completion side. The more I thought about this the more I realize its actually not buying us much of anything over using the Msft API due to the effects of CPU cache and the fact that the SQ access on submission is a write and not a read. I also heard from various other folks that they didn't find the API below to be accurate all of the time, I can't say from experience that I've seen this. That said, I will likely be proposing an alternate method in the near future so I'll go ahead and propose it now since you brought up the subject: Proposal: no long decompose the MSI address to populate the mapping table. Instead, start off with a 1:1 mapping and 'learn and update' the mapping table on the completion side. Would still avoid the storport API because I don't think it adds value over the learned method and requires us to use the DPC steering option which I've witnessed to have unpredictable side effects. I do plan on following up with Msft (and have already had several internal discussions at Intel with other storport devs) on exactly how some of these optimizations within storport so we can better gauge whether they're a good fit for us or now. Pro: This will *always* work whereas the method we have no does not work for APIC logical mode. I prefer a simple "one size fits all" solution every time over a 2 path solution even if one path is slightly optimized. It makes the driver more maintainable and gives us less variables in debug (right now we don't even store whether we found the APIC in phy or logical mode so during debug you don't really know). Con: SQ mem will be on a different core than the submitting thread but I don't believe this is a measurable issue. Certainly can perform some experiments to check though Other thoughts? Thx Paul From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Greg de Valois Sent: Tuesday, February 14, 2012 5:02 PM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] Question on Processor to MSI vector mapping All: I'm wondering if anyone can explain to me the reasoning behind the processor to MSI vector translation that is being done by the driver, instead of using the vector returned from StorPortGetStartIoPerfParams for each IO? Are there cases where this doesn't work properly? Thanks, Greg de Valois SanDisk ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -------------- next part -------------- An HTML attachment was scrubbed... URL: From Greg.deValois at sandisk.com Wed Feb 15 12:39:30 2012 From: Greg.deValois at sandisk.com (Greg de Valois) Date: Wed, 15 Feb 2012 12:39:30 -0800 Subject: [nvmewin] Question on Processor to MSI vector mapping In-Reply-To: <82C9F782B054C94B9FC04A331649C77A05605E@FMSMSX106.amr.corp.intel.com> References: , <82C9F782B054C94B9FC04A331649C77A056026@FMSMSX106.amr.corp.intel.com> , <82C9F782B054C94B9FC04A331649C77A05605E@FMSMSX106.amr.corp.intel.com> Message-ID: My understanding is that Windows assigns the number of MSI vectors requested (if possible) by the INF to an adapter, each of which is associated with one of the availalbe processors. The driver sets up a queue pair for each of those vectors, and when an IO comes, it is put on the queue associated with the vector that is returned from the API call. This vector is one that has been assigned to the current executing core. As long as that API works correctly, the driver doesn't have to do anything more to assure that the IO is submitted and completed on that vector. Is there some part of those statements that isn't correct? I'm a little confused here. Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 11:14 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Because we create the SQ and CQ before the first IO and we have to provide the CQ vector when created. So, for example lets say we have 3 cores and an SQ/CQ pair all numbered the same as the core #. When we create them we arbitrarily give the CQ a vector to complete on. When the first OI comes lets say its on core 1 and the stoport PA tells us the vector we should expect it on is 3, we’d need to submit on SQ3 but we’re on core 1. There are ways around this of course, we could do some creative things in passiveInit wrt the creation of the CQ’s and submitting test IOs, re-creating, etc., but those things I don’t think are worth the complexity. Make sense? Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 12:06 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Paul: Thanks for the reply. I'm afraid I'm not quite following you: why does using the API provided by Storport to get the MSI vector associated with the current request imply that you're submitting on a different core than completing? The core you're executing on is the one assigned the vector that the API gives you, you put it on the submission queue that has been assigned to that vector, and it's completed on the completion queue for that vector as well. What am I missing here? Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 10:57 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Hi Greg- Thanks for the question, it’s a good one! The reasoning was mainly because it was fairly straightforward and had one benefit over the method you mention – we could assure NUMA optimization and vector matching for both SQ and CQ. There are many, many ways that one could approach this problem and we discussed a few as part of the dev of this driver and then individually discussed experiences with various methods with other dev teams at our respective companies. If one uses the method you mention below, we’d create our SQ/CQ per core NUMA optimized and then be submitting on a different core than we complete on however we’d still be optimizing the completion side. The more I thought about this the more I realize its actually not buying us much of anything over using the Msft API due to the effects of CPU cache and the fact that the SQ access on submission is a write and not a read. I also heard from various other folks that they didn’t find the API below to be accurate all of the time, I can’t say from experience that I’ve seen this. That said, I will likely be proposing an alternate method in the near future so I’ll go ahead and propose it now since you brought up the subject: Proposal: no long decompose the MSI address to populate the mapping table. Instead, start off with a 1:1 mapping and ‘learn and update’ the mapping table on the completion side. Would still avoid the storport API because I don’t think it adds value over the learned method and requires us to use the DPC steering option which I’ve witnessed to have unpredictable side effects. I do plan on following up with Msft (and have already had several internal discussions at Intel with other storport devs) on exactly how some of these optimizations within storport so we can better gauge whether they’re a good fit for us or now. Pro: This will *always* work whereas the method we have no does not work for APIC logical mode. I prefer a simple “one size fits all” solution every time over a 2 path solution even if one path is slightly optimized. It makes the driver more maintainable and gives us less variables in debug (right now we don’t even store whether we found the APIC in phy or logical mode so during debug you don’t really know). Con: SQ mem will be on a different core than the submitting thread but I don’t believe this is a measurable issue. Certainly can perform some experiments to check though Other thoughts? Thx Paul From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Greg de Valois Sent: Tuesday, February 14, 2012 5:02 PM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] Question on Processor to MSI vector mapping All: I'm wondering if anyone can explain to me the reasoning behind the processor to MSI vector translation that is being done by the driver, instead of using the vector returned from StorPortGetStartIoPerfParams for each IO? Are there cases where this doesn't work properly? Thanks, Greg de Valois SanDisk ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.e.luse at intel.com Wed Feb 15 12:46:49 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Wed, 15 Feb 2012 20:46:49 +0000 Subject: [nvmewin] Question on Processor to MSI vector mapping In-Reply-To: References: , <82C9F782B054C94B9FC04A331649C77A056026@FMSMSX106.amr.corp.intel.com> , <82C9F782B054C94B9FC04A331649C77A05605E@FMSMSX106.amr.corp.intel.com> Message-ID: <82C9F782B054C94B9FC04A331649C77A0563AF@FMSMSX106.amr.corp.intel.com> I think the point you're missing (or I'm not understanding) is when the driver creates the CQ it tells the device which MSIX vector to use for that CQ. We don't know that at the time we create the CQ. All we know when we create the CQ is the number of vectors and the vector numbers but we don't know which one is associated with which core. The API only provides that info on a per IO basis. So, everything you describe is correct however the downside is (not that big of a deal but as compared to what we're doing now) that because we don't know the mapping when we create the SQ/CQ pair we may be placing the IO on a different core than the SQ is associated with because the API told us about a vector that is associated with a different Q pair than that we setup for the current core. Feel free to call me if that would be easier - 480 554 3688 Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 1:40 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping My understanding is that Windows assigns the number of MSI vectors requested (if possible) by the INF to an adapter, each of which is associated with one of the available processors. The driver sets up a queue pair for each of those vectors, and when an IO comes, it is put on the queue associated with the vector that is returned from the API call. This vector is one that has been assigned to the current executing core. As long as that API works correctly, the driver doesn't have to do anything more to assure that the IO is submitted and completed on that vector. Is there some part of those statements that isn't correct? I'm a little confused here. Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 11:14 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Because we create the SQ and CQ before the first IO and we have to provide the CQ vector when created. So, for example lets say we have 3 cores and an SQ/CQ pair all numbered the same as the core #. When we create them we arbitrarily give the CQ a vector to complete on. When the first OI comes lets say its on core 1 and the stoport PA tells us the vector we should expect it on is 3, we'd need to submit on SQ3 but we're on core 1. There are ways around this of course, we could do some creative things in passiveInit wrt the creation of the CQ's and submitting test IOs, re-creating, etc., but those things I don't think are worth the complexity. Make sense? Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 12:06 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Paul: Thanks for the reply. I'm afraid I'm not quite following you: why does using the API provided by Storport to get the MSI vector associated with the current request imply that you're submitting on a different core than completing? The core you're executing on is the one assigned the vector that the API gives you, you put it on the submission queue that has been assigned to that vector, and it's completed on the completion queue for that vector as well. What am I missing here? Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 10:57 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Hi Greg- Thanks for the question, it's a good one! The reasoning was mainly because it was fairly straightforward and had one benefit over the method you mention - we could assure NUMA optimization and vector matching for both SQ and CQ. There are many, many ways that one could approach this problem and we discussed a few as part of the dev of this driver and then individually discussed experiences with various methods with other dev teams at our respective companies. If one uses the method you mention below, we'd create our SQ/CQ per core NUMA optimized and then be submitting on a different core than we complete on however we'd still be optimizing the completion side. The more I thought about this the more I realize its actually not buying us much of anything over using the Msft API due to the effects of CPU cache and the fact that the SQ access on submission is a write and not a read. I also heard from various other folks that they didn't find the API below to be accurate all of the time, I can't say from experience that I've seen this. That said, I will likely be proposing an alternate method in the near future so I'll go ahead and propose it now since you brought up the subject: Proposal: no long decompose the MSI address to populate the mapping table. Instead, start off with a 1:1 mapping and 'learn and update' the mapping table on the completion side. Would still avoid the storport API because I don't think it adds value over the learned method and requires us to use the DPC steering option which I've witnessed to have unpredictable side effects. I do plan on following up with Msft (and have already had several internal discussions at Intel with other storport devs) on exactly how some of these optimizations within storport so we can better gauge whether they're a good fit for us or now. Pro: This will *always* work whereas the method we have no does not work for APIC logical mode. I prefer a simple "one size fits all" solution every time over a 2 path solution even if one path is slightly optimized. It makes the driver more maintainable and gives us less variables in debug (right now we don't even store whether we found the APIC in phy or logical mode so during debug you don't really know). Con: SQ mem will be on a different core than the submitting thread but I don't believe this is a measurable issue. Certainly can perform some experiments to check though Other thoughts? Thx Paul From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Greg de Valois Sent: Tuesday, February 14, 2012 5:02 PM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] Question on Processor to MSI vector mapping All: I'm wondering if anyone can explain to me the reasoning behind the processor to MSI vector translation that is being done by the driver, instead of using the vector returned from StorPortGetStartIoPerfParams for each IO? Are there cases where this doesn't work properly? Thanks, Greg de Valois SanDisk ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -------------- next part -------------- An HTML attachment was scrubbed... URL: From Greg.deValois at sandisk.com Wed Feb 15 13:02:34 2012 From: Greg.deValois at sandisk.com (Greg de Valois) Date: Wed, 15 Feb 2012 13:02:34 -0800 Subject: [nvmewin] Question on Processor to MSI vector mapping In-Reply-To: <82C9F782B054C94B9FC04A331649C77A0563AF@FMSMSX106.amr.corp.intel.com> References: , <82C9F782B054C94B9FC04A331649C77A056026@FMSMSX106.amr.corp.intel.com> , <82C9F782B054C94B9FC04A331649C77A05605E@FMSMSX106.amr.corp.intel.com> , <82C9F782B054C94B9FC04A331649C77A0563AF@FMSMSX106.amr.corp.intel.com> Message-ID: I'm not understanding why it matters which core has been assigned which vector, as we have IO associated with a vector, put onto a queue associated with that vector, and the controller completes the IO using that vector. Why should the driver and controller care what the mapping of vectors to processors is? Thanks, Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 12:46 PM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping I think the point you’re missing (or I’m not understanding) is when the driver creates the CQ it tells the device which MSIX vector to use for that CQ. We don’t know that at the time we create the CQ. All we know when we create the CQ is the number of vectors and the vector numbers but we don’t know which one is associated with which core. The API only provides that info on a per IO basis. So, everything you describe is correct however the downside is (not that big of a deal but as compared to what we’re doing now) that because we don’t know the mapping when we create the SQ/CQ pair we may be placing the IO on a different core than the SQ is associated with because the API told us about a vector that is associated with a different Q pair than that we setup for the current core. Feel free to call me if that would be easier – 480 554 3688 Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 1:40 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping My understanding is that Windows assigns the number of MSI vectors requested (if possible) by the INF to an adapter, each of which is associated with one of the available processors. The driver sets up a queue pair for each of those vectors, and when an IO comes, it is put on the queue associated with the vector that is returned from the API call. This vector is one that has been assigned to the current executing core. As long as that API works correctly, the driver doesn't have to do anything more to assure that the IO is submitted and completed on that vector. Is there some part of those statements that isn't correct? I'm a little confused here. Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 11:14 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Because we create the SQ and CQ before the first IO and we have to provide the CQ vector when created. So, for example lets say we have 3 cores and an SQ/CQ pair all numbered the same as the core #. When we create them we arbitrarily give the CQ a vector to complete on. When the first OI comes lets say its on core 1 and the stoport PA tells us the vector we should expect it on is 3, we’d need to submit on SQ3 but we’re on core 1. There are ways around this of course, we could do some creative things in passiveInit wrt the creation of the CQ’s and submitting test IOs, re-creating, etc., but those things I don’t think are worth the complexity. Make sense? Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 12:06 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Paul: Thanks for the reply. I'm afraid I'm not quite following you: why does using the API provided by Storport to get the MSI vector associated with the current request imply that you're submitting on a different core than completing? The core you're executing on is the one assigned the vector that the API gives you, you put it on the submission queue that has been assigned to that vector, and it's completed on the completion queue for that vector as well. What am I missing here? Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 10:57 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Hi Greg- Thanks for the question, it’s a good one! The reasoning was mainly because it was fairly straightforward and had one benefit over the method you mention – we could assure NUMA optimization and vector matching for both SQ and CQ. There are many, many ways that one could approach this problem and we discussed a few as part of the dev of this driver and then individually discussed experiences with various methods with other dev teams at our respective companies. If one uses the method you mention below, we’d create our SQ/CQ per core NUMA optimized and then be submitting on a different core than we complete on however we’d still be optimizing the completion side. The more I thought about this the more I realize its actually not buying us much of anything over using the Msft API due to the effects of CPU cache and the fact that the SQ access on submission is a write and not a read. I also heard from various other folks that they didn’t find the API below to be accurate all of the time, I can’t say from experience that I’ve seen this. That said, I will likely be proposing an alternate method in the near future so I’ll go ahead and propose it now since you brought up the subject: Proposal: no long decompose the MSI address to populate the mapping table. Instead, start off with a 1:1 mapping and ‘learn and update’ the mapping table on the completion side. Would still avoid the storport API because I don’t think it adds value over the learned method and requires us to use the DPC steering option which I’ve witnessed to have unpredictable side effects. I do plan on following up with Msft (and have already had several internal discussions at Intel with other storport devs) on exactly how some of these optimizations within storport so we can better gauge whether they’re a good fit for us or now. Pro: This will *always* work whereas the method we have no does not work for APIC logical mode. I prefer a simple “one size fits all” solution every time over a 2 path solution even if one path is slightly optimized. It makes the driver more maintainable and gives us less variables in debug (right now we don’t even store whether we found the APIC in phy or logical mode so during debug you don’t really know). Con: SQ mem will be on a different core than the submitting thread but I don’t believe this is a measurable issue. Certainly can perform some experiments to check though Other thoughts? Thx Paul From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Greg de Valois Sent: Tuesday, February 14, 2012 5:02 PM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] Question on Processor to MSI vector mapping All: I'm wondering if anyone can explain to me the reasoning behind the processor to MSI vector translation that is being done by the driver, instead of using the vector returned from StorPortGetStartIoPerfParams for each IO? Are there cases where this doesn't work properly? Thanks, Greg de Valois SanDisk ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.e.luse at intel.com Wed Feb 15 13:15:38 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Wed, 15 Feb 2012 21:15:38 +0000 Subject: [nvmewin] Question on Processor to MSI vector mapping In-Reply-To: References: , <82C9F782B054C94B9FC04A331649C77A056026@FMSMSX106.amr.corp.intel.com> , <82C9F782B054C94B9FC04A331649C77A05605E@FMSMSX106.amr.corp.intel.com> , <82C9F782B054C94B9FC04A331649C77A0563AF@FMSMSX106.amr.corp.intel.com> Message-ID: <82C9F782B054C94B9FC04A331649C77A056419@FMSMSX106.amr.corp.intel.com> The only difference between what we're doing and what you get with either the storport API or the 'learning method' I mention is the locality of the SQ memory when the IO is submitted. With what we're doing now, we know that the submission core is local to SQ mem. We don't have that guarantee otherwise. Its negligible I believe which is why I'm proposing we move to the learning mode (very small cod change, 've made it and tested). I prefer learning mode over the storport API because it works regardless of OS dependencies, APIC modes, etc. As an example of the tiny SQ mem difference, take this scenario Core 0: SQ0 and CQ0 are allocated local to core 0. Core 1: SQ1 and CQ1 are allocated local to core 1. Core 2: SQ2 and CQ2 are allocated local to core 2. In our current implementation if we get an IO coming in on Core 1: We submit it to SQ1 (local mem), when we created CQ we knew the vector so we also know that this IO will complete on CQ1 Using the storport API, if we get an IO coming in Core 1: There's no assurance that we'll be using SQ1. If the API tells us that the vector is 2, for example, we'd then choose to submit on SQ1 which is local to core 2, not core 1. On the completion side, everything is just as optimal. The IO completes on core 2 which both SQ/CQ for core2 are local to core 2 Does that help? Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 2:03 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping I'm not understanding why it matters which core has been assigned which vector, as we have IO associated with a vector, put onto a queue associated with that vector, and the controller completes the IO using that vector. Why should the driver and controller care what the mapping of vectors to processors is? Thanks, Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 12:46 PM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping I think the point you're missing (or I'm not understanding) is when the driver creates the CQ it tells the device which MSIX vector to use for that CQ. We don't know that at the time we create the CQ. All we know when we create the CQ is the number of vectors and the vector numbers but we don't know which one is associated with which core. The API only provides that info on a per IO basis. So, everything you describe is correct however the downside is (not that big of a deal but as compared to what we're doing now) that because we don't know the mapping when we create the SQ/CQ pair we may be placing the IO on a different core than the SQ is associated with because the API told us about a vector that is associated with a different Q pair than that we setup for the current core. Feel free to call me if that would be easier - 480 554 3688 Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 1:40 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping My understanding is that Windows assigns the number of MSI vectors requested (if possible) by the INF to an adapter, each of which is associated with one of the available processors. The driver sets up a queue pair for each of those vectors, and when an IO comes, it is put on the queue associated with the vector that is returned from the API call. This vector is one that has been assigned to the current executing core. As long as that API works correctly, the driver doesn't have to do anything more to assure that the IO is submitted and completed on that vector. Is there some part of those statements that isn't correct? I'm a little confused here. Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 11:14 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Because we create the SQ and CQ before the first IO and we have to provide the CQ vector when created. So, for example lets say we have 3 cores and an SQ/CQ pair all numbered the same as the core #. When we create them we arbitrarily give the CQ a vector to complete on. When the first OI comes lets say its on core 1 and the stoport PA tells us the vector we should expect it on is 3, we'd need to submit on SQ3 but we're on core 1. There are ways around this of course, we could do some creative things in passiveInit wrt the creation of the CQ's and submitting test IOs, re-creating, etc., but those things I don't think are worth the complexity. Make sense? Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 12:06 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Paul: Thanks for the reply. I'm afraid I'm not quite following you: why does using the API provided by Storport to get the MSI vector associated with the current request imply that you're submitting on a different core than completing? The core you're executing on is the one assigned the vector that the API gives you, you put it on the submission queue that has been assigned to that vector, and it's completed on the completion queue for that vector as well. What am I missing here? Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 10:57 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Hi Greg- Thanks for the question, it's a good one! The reasoning was mainly because it was fairly straightforward and had one benefit over the method you mention - we could assure NUMA optimization and vector matching for both SQ and CQ. There are many, many ways that one could approach this problem and we discussed a few as part of the dev of this driver and then individually discussed experiences with various methods with other dev teams at our respective companies. If one uses the method you mention below, we'd create our SQ/CQ per core NUMA optimized and then be submitting on a different core than we complete on however we'd still be optimizing the completion side. The more I thought about this the more I realize its actually not buying us much of anything over using the Msft API due to the effects of CPU cache and the fact that the SQ access on submission is a write and not a read. I also heard from various other folks that they didn't find the API below to be accurate all of the time, I can't say from experience that I've seen this. That said, I will likely be proposing an alternate method in the near future so I'll go ahead and propose it now since you brought up the subject: Proposal: no long decompose the MSI address to populate the mapping table. Instead, start off with a 1:1 mapping and 'learn and update' the mapping table on the completion side. Would still avoid the storport API because I don't think it adds value over the learned method and requires us to use the DPC steering option which I've witnessed to have unpredictable side effects. I do plan on following up with Msft (and have already had several internal discussions at Intel with other storport devs) on exactly how some of these optimizations within storport so we can better gauge whether they're a good fit for us or now. Pro: This will *always* work whereas the method we have no does not work for APIC logical mode. I prefer a simple "one size fits all" solution every time over a 2 path solution even if one path is slightly optimized. It makes the driver more maintainable and gives us less variables in debug (right now we don't even store whether we found the APIC in phy or logical mode so during debug you don't really know). Con: SQ mem will be on a different core than the submitting thread but I don't believe this is a measurable issue. Certainly can perform some experiments to check though Other thoughts? Thx Paul From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Greg de Valois Sent: Tuesday, February 14, 2012 5:02 PM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] Question on Processor to MSI vector mapping All: I'm wondering if anyone can explain to me the reasoning behind the processor to MSI vector translation that is being done by the driver, instead of using the vector returned from StorPortGetStartIoPerfParams for each IO? Are there cases where this doesn't work properly? Thanks, Greg de Valois SanDisk ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.e.luse at intel.com Wed Feb 15 13:20:09 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Wed, 15 Feb 2012 21:20:09 +0000 Subject: [nvmewin] Question on Processor to MSI vector mapping In-Reply-To: <82C9F782B054C94B9FC04A331649C77A056419@FMSMSX106.amr.corp.intel.com> References: , <82C9F782B054C94B9FC04A331649C77A056026@FMSMSX106.amr.corp.intel.com> , <82C9F782B054C94B9FC04A331649C77A05605E@FMSMSX106.amr.corp.intel.com> , <82C9F782B054C94B9FC04A331649C77A0563AF@FMSMSX106.amr.corp.intel.com> <82C9F782B054C94B9FC04A331649C77A056419@FMSMSX106.amr.corp.intel.com> Message-ID: <82C9F782B054C94B9FC04A331649C77A056466@FMSMSX106.amr.corp.intel.com> Ugh, small but important correction below From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Luse, Paul E Sent: Wednesday, February 15, 2012 2:16 PM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: Re: [nvmewin] Question on Processor to MSI vector mapping The only difference between what we're doing and what you get with either the storport API or the 'learning method' I mention is the locality of the SQ memory when the IO is submitted. With what we're doing now, we know that the submission core is local to SQ mem. We don't have that guarantee otherwise. Its negligible I believe which is why I'm proposing we move to the learning mode (very small cod change, 've made it and tested). I prefer learning mode over the storport API because it works regardless of OS dependencies, APIC modes, etc. As an example of the tiny SQ mem difference, take this scenario Core 0: SQ0 and CQ0 are allocated local to core 0. Core 1: SQ1 and CQ1 are allocated local to core 1. Core 2: SQ2 and CQ2 are allocated local to core 2. In our current implementation if we get an IO coming in on Core 1: We submit it to SQ1 (local mem), when we created CQ we knew the vector so we also know that this IO will complete on CQ1 Using the storport API, if we get an IO coming in Core 1: There's no assurance that we'll be using SQ1. If the API tells us that the vector is 2, for example, we'd then choose to submit on *SQ2* which is local to core 2, not core 1. On the completion side, everything is just as optimal. The IO completes on core 2 which both SQ/CQ for core2 are local to core 2 Does that help? Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 2:03 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping I'm not understanding why it matters which core has been assigned which vector, as we have IO associated with a vector, put onto a queue associated with that vector, and the controller completes the IO using that vector. Why should the driver and controller care what the mapping of vectors to processors is? Thanks, Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 12:46 PM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping I think the point you're missing (or I'm not understanding) is when the driver creates the CQ it tells the device which MSIX vector to use for that CQ. We don't know that at the time we create the CQ. All we know when we create the CQ is the number of vectors and the vector numbers but we don't know which one is associated with which core. The API only provides that info on a per IO basis. So, everything you describe is correct however the downside is (not that big of a deal but as compared to what we're doing now) that because we don't know the mapping when we create the SQ/CQ pair we may be placing the IO on a different core than the SQ is associated with because the API told us about a vector that is associated with a different Q pair than that we setup for the current core. Feel free to call me if that would be easier - 480 554 3688 Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 1:40 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping My understanding is that Windows assigns the number of MSI vectors requested (if possible) by the INF to an adapter, each of which is associated with one of the available processors. The driver sets up a queue pair for each of those vectors, and when an IO comes, it is put on the queue associated with the vector that is returned from the API call. This vector is one that has been assigned to the current executing core. As long as that API works correctly, the driver doesn't have to do anything more to assure that the IO is submitted and completed on that vector. Is there some part of those statements that isn't correct? I'm a little confused here. Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 11:14 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Because we create the SQ and CQ before the first IO and we have to provide the CQ vector when created. So, for example lets say we have 3 cores and an SQ/CQ pair all numbered the same as the core #. When we create them we arbitrarily give the CQ a vector to complete on. When the first OI comes lets say its on core 1 and the stoport PA tells us the vector we should expect it on is 3, we'd need to submit on SQ3 but we're on core 1. There are ways around this of course, we could do some creative things in passiveInit wrt the creation of the CQ's and submitting test IOs, re-creating, etc., but those things I don't think are worth the complexity. Make sense? Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 12:06 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Paul: Thanks for the reply. I'm afraid I'm not quite following you: why does using the API provided by Storport to get the MSI vector associated with the current request imply that you're submitting on a different core than completing? The core you're executing on is the one assigned the vector that the API gives you, you put it on the submission queue that has been assigned to that vector, and it's completed on the completion queue for that vector as well. What am I missing here? Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 10:57 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Hi Greg- Thanks for the question, it's a good one! The reasoning was mainly because it was fairly straightforward and had one benefit over the method you mention - we could assure NUMA optimization and vector matching for both SQ and CQ. There are many, many ways that one could approach this problem and we discussed a few as part of the dev of this driver and then individually discussed experiences with various methods with other dev teams at our respective companies. If one uses the method you mention below, we'd create our SQ/CQ per core NUMA optimized and then be submitting on a different core than we complete on however we'd still be optimizing the completion side. The more I thought about this the more I realize its actually not buying us much of anything over using the Msft API due to the effects of CPU cache and the fact that the SQ access on submission is a write and not a read. I also heard from various other folks that they didn't find the API below to be accurate all of the time, I can't say from experience that I've seen this. That said, I will likely be proposing an alternate method in the near future so I'll go ahead and propose it now since you brought up the subject: Proposal: no long decompose the MSI address to populate the mapping table. Instead, start off with a 1:1 mapping and 'learn and update' the mapping table on the completion side. Would still avoid the storport API because I don't think it adds value over the learned method and requires us to use the DPC steering option which I've witnessed to have unpredictable side effects. I do plan on following up with Msft (and have already had several internal discussions at Intel with other storport devs) on exactly how some of these optimizations within storport so we can better gauge whether they're a good fit for us or now. Pro: This will *always* work whereas the method we have no does not work for APIC logical mode. I prefer a simple "one size fits all" solution every time over a 2 path solution even if one path is slightly optimized. It makes the driver more maintainable and gives us less variables in debug (right now we don't even store whether we found the APIC in phy or logical mode so during debug you don't really know). Con: SQ mem will be on a different core than the submitting thread but I don't believe this is a measurable issue. Certainly can perform some experiments to check though Other thoughts? Thx Paul From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Greg de Valois Sent: Tuesday, February 14, 2012 5:02 PM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] Question on Processor to MSI vector mapping All: I'm wondering if anyone can explain to me the reasoning behind the processor to MSI vector translation that is being done by the driver, instead of using the vector returned from StorPortGetStartIoPerfParams for each IO? Are there cases where this doesn't work properly? Thanks, Greg de Valois SanDisk ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -------------- next part -------------- An HTML attachment was scrubbed... URL: From Greg.deValois at SanDisk.com Wed Feb 15 14:31:15 2012 From: Greg.deValois at SanDisk.com (Greg de Valois) Date: Wed, 15 Feb 2012 14:31:15 -0800 Subject: [nvmewin] Question on Processor to MSI vector mapping In-Reply-To: <82C9F782B054C94B9FC04A331649C77A056466@FMSMSX106.amr.corp.intel.com> References: , <82C9F782B054C94B9FC04A331649C77A056026@FMSMSX106.amr.corp.intel.com> , <82C9F782B054C94B9FC04A331649C77A05605E@FMSMSX106.amr.corp.intel.com> , <82C9F782B054C94B9FC04A331649C77A0563AF@FMSMSX106.amr.corp.intel.com> <82C9F782B054C94B9FC04A331649C77A056419@FMSMSX106.amr.corp.intel.com>, <82C9F782B054C94B9FC04A331649C77A056466@FMSMSX106.amr.corp.intel.com> Message-ID: Paul: Ok, what I wasn't getting is that you're allocating the queue pairs using SpecifyCache with the proc number. Got it. But if you think that having the SQ queue memory itself allocated close to the associated core doesn't matter so much (I would think it wouldn't, since the driver is accessing other memory that is not associated with that core either), now I don't know why you wouldn't then just use the StorPort API. Also, I'm not sure what you mean by the following: Would still avoid the storport API because I don’t think it adds value over the learned method and requires us to use the DPC steering option which I’ve witnessed to have unpredictable side effects. By "DPC steering" I assume you mean enabling the STOR_PERF_DPC_REDIRECTION flag for StorPortInitializePerfOpts, which I see the driver is doing, or are you referring to something else? But I understand what you're doing now with the core-vector mapping. Thanks for the explanation. Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 1:20 PM To: Luse, Paul E; Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Ugh, small but important correction below From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Luse, Paul E Sent: Wednesday, February 15, 2012 2:16 PM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: Re: [nvmewin] Question on Processor to MSI vector mapping The only difference between what we’re doing and what you get with either the storport API or the ‘learning method’ I mention is the locality of the SQ memory when the IO is submitted. With what we’re doing now, we know that the submission core is local to SQ mem. We don’t have that guarantee otherwise. Its negligible I believe which is why I’m proposing we move to the learning mode (very small cod change, ‘ve made it and tested). I prefer learning mode over the storport API because it works regardless of OS dependencies, APIC modes, etc. As an example of the tiny SQ mem difference, take this scenario Core 0: SQ0 and CQ0 are allocated local to core 0. Core 1: SQ1 and CQ1 are allocated local to core 1. Core 2: SQ2 and CQ2 are allocated local to core 2. In our current implementation if we get an IO coming in on Core 1: We submit it to SQ1 (local mem), when we created CQ we knew the vector so we also know that this IO will complete on CQ1 Using the storport API, if we get an IO coming in Core 1: There’s no assurance that we’ll be using SQ1. If the API tells us that the vector is 2, for example, we’d then choose to submit on *SQ2* which is local to core 2, not core 1. On the completion side, everything is just as optimal. The IO completes on core 2 which both SQ/CQ for core2 are local to core 2 Does that help? Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 2:03 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping I'm not understanding why it matters which core has been assigned which vector, as we have IO associated with a vector, put onto a queue associated with that vector, and the controller completes the IO using that vector. Why should the driver and controller care what the mapping of vectors to processors is? Thanks, Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 12:46 PM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping I think the point you’re missing (or I’m not understanding) is when the driver creates the CQ it tells the device which MSIX vector to use for that CQ. We don’t know that at the time we create the CQ. All we know when we create the CQ is the number of vectors and the vector numbers but we don’t know which one is associated with which core. The API only provides that info on a per IO basis. So, everything you describe is correct however the downside is (not that big of a deal but as compared to what we’re doing now) that because we don’t know the mapping when we create the SQ/CQ pair we may be placing the IO on a different core than the SQ is associated with because the API told us about a vector that is associated with a different Q pair than that we setup for the current core. Feel free to call me if that would be easier – 480 554 3688 Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 1:40 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping My understanding is that Windows assigns the number of MSI vectors requested (if possible) by the INF to an adapter, each of which is associated with one of the available processors. The driver sets up a queue pair for each of those vectors, and when an IO comes, it is put on the queue associated with the vector that is returned from the API call. This vector is one that has been assigned to the current executing core. As long as that API works correctly, the driver doesn't have to do anything more to assure that the IO is submitted and completed on that vector. Is there some part of those statements that isn't correct? I'm a little confused here. Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 11:14 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Because we create the SQ and CQ before the first IO and we have to provide the CQ vector when created. So, for example lets say we have 3 cores and an SQ/CQ pair all numbered the same as the core #. When we create them we arbitrarily give the CQ a vector to complete on. When the first OI comes lets say its on core 1 and the stoport PA tells us the vector we should expect it on is 3, we’d need to submit on SQ3 but we’re on core 1. There are ways around this of course, we could do some creative things in passiveInit wrt the creation of the CQ’s and submitting test IOs, re-creating, etc., but those things I don’t think are worth the complexity. Make sense? Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 12:06 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Paul: Thanks for the reply. I'm afraid I'm not quite following you: why does using the API provided by Storport to get the MSI vector associated with the current request imply that you're submitting on a different core than completing? The core you're executing on is the one assigned the vector that the API gives you, you put it on the submission queue that has been assigned to that vector, and it's completed on the completion queue for that vector as well. What am I missing here? Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 10:57 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Hi Greg- Thanks for the question, it’s a good one! The reasoning was mainly because it was fairly straightforward and had one benefit over the method you mention – we could assure NUMA optimization and vector matching for both SQ and CQ. There are many, many ways that one could approach this problem and we discussed a few as part of the dev of this driver and then individually discussed experiences with various methods with other dev teams at our respective companies. If one uses the method you mention below, we’d create our SQ/CQ per core NUMA optimized and then be submitting on a different core than we complete on however we’d still be optimizing the completion side. The more I thought about this the more I realize its actually not buying us much of anything over using the Msft API due to the effects of CPU cache and the fact that the SQ access on submission is a write and not a read. I also heard from various other folks that they didn’t find the API below to be accurate all of the time, I can’t say from experience that I’ve seen this. That said, I will likely be proposing an alternate method in the near future so I’ll go ahead and propose it now since you brought up the subject: Proposal: no long decompose the MSI address to populate the mapping table. Instead, start off with a 1:1 mapping and ‘learn and update’ the mapping table on the completion side. Would still avoid the storport API because I don’t think it adds value over the learned method and requires us to use the DPC steering option which I’ve witnessed to have unpredictable side effects. I do plan on following up with Msft (and have already had several internal discussions at Intel with other storport devs) on exactly how some of these optimizations within storport so we can better gauge whether they’re a good fit for us or now. Pro: This will *always* work whereas the method we have no does not work for APIC logical mode. I prefer a simple “one size fits all” solution every time over a 2 path solution even if one path is slightly optimized. It makes the driver more maintainable and gives us less variables in debug (right now we don’t even store whether we found the APIC in phy or logical mode so during debug you don’t really know). Con: SQ mem will be on a different core than the submitting thread but I don’t believe this is a measurable issue. Certainly can perform some experiments to check though Other thoughts? Thx Paul From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Greg de Valois Sent: Tuesday, February 14, 2012 5:02 PM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] Question on Processor to MSI vector mapping All: I'm wondering if anyone can explain to me the reasoning behind the processor to MSI vector translation that is being done by the driver, instead of using the vector returned from StorPortGetStartIoPerfParams for each IO? Are there cases where this doesn't work properly? Thanks, Greg de Valois SanDisk ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alex.Chang at idt.com Wed Feb 15 15:00:16 2012 From: Alex.Chang at idt.com (Chang, Alex) Date: Wed, 15 Feb 2012 23:00:16 +0000 Subject: [nvmewin] patch reminder In-Reply-To: <82C9F782B054C94B9FC04A331649C77A04D485@ORSMSX152.amr.corp.intel.com> References: <82C9F782B054C94B9FC04A331649C77A04D485@ORSMSX152.amr.corp.intel.com> Message-ID: <548C5470AAD9DA4A85D259B663190D3690F1@corpmail1.na.ads.idt.com> Hi Paul and Ray, I had merged the patch and tested it a week also: drive formatting, busTrace, IOMeters, SCSI_Compliance and SDStress, etc. It works well. The only place I did not test is when the interrupt routing is in logical mode. I wonder if Paul has tested it or not. Thanks, Alex ________________________________ From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Luse, Paul E Sent: Tuesday, February 14, 2012 9:19 AM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] patch reminder All- Although we're not on a strict timeline, I'd like to make sure patches don't sit for too long. This one is close to two weeks old, IDT & LSI can you guys take a few minutes to review and let Ray know if its good to go or comment otherwise? Note that its been running now on a test machine (full speed with Chatham hw and busTRACE 32 thread data integrity) for over a week now w/no issues. Thx Paul From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Luse, Paul E Sent: Friday, February 03, 2012 11:19 AM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] ***UNCHECKED*** REVIEW REQUEST: patch for perf fix and other misc updates Password for zip is nvme1234 Main changes: (based on tags\format_nvm_pt_ioctl) 1) Fix DPC watchdog timeout under heavy IO by adding call to ste per lun storport queue depth following INQ 2) Added support for optional (compile switch) DPC or ISR completions. Defaulting to DPC as this is 'standard' recommended method 3) Updated mode block descriptor creation to return all F's for # blocks if namespace is too big to fit in field (per SPC) 4) Changed logical mode to 1:1 map cores to MSIX vectors, not optimal for vector matching but better than sending all IO through one core and we're covered in any scenario wrt protection on submit/complete 5) Pile of CHATHAM only changes 6) Changed passiveInit to wait for state machine to complete based on lots of issues with us missing enum because we weren't ready and storport doesn't retry the early enum commands. Ran into this at Msft as well as UNH when using the chatham in various platforms. Ray also got it with QEMU on his HW (different speed than mine) Tested (2008-R2 with Chatham and Win7-64 with QEMU, with and without driver verifier): - Sdstress - SCSI compliance (write 10 fails, not clear why as trace shows no issue. Fails with baseline code also, note related to these changes) - BusTRACE scsi compliance - BusTRACE data integrity - Iometer all access specs, Q depth 32 8 workers Changes: Nvme.inf: - Updated version Nvmeinit.c - Misc asserts added, some braces added here and there for readability - NVMeMsixMapCores(): changes to support logical mode using all cores/all vectors 1:1 mapped - Misc chatham changes - Compile switch for DPC or ISR Nvmeio.c - New assert nvmePwrMgmt.c - Chatham only changes nvmeSnti.c - SntiTranslateCommand() added adapter ext parm for use by API to set per lun Q depth, also set Q depth post INQ - Bunch of chatham changes - SntiCreateModeParameterDescBlock() added code to correctly fill in # blocks when we overflow nvmeSnti.h - Defines used by Q depth setting, function proto changes nvmeStd.c - DPC vs ISR compile switches - PassiveInit waits on init state machine now - Removed storport perf opt, has on effect based on our mapping - Changed assert checking on vector/proc mapping so it doesn't affect admin queue, is ignored for QEMU and for logical mode - NVMeIsrMsix: fixed issue where shared mode would cause BSOD - Added ISR completion support - Chatham changes nvmeStd.h - Supporting sturct changes Sources - New compile switches for ISR vs DPC and for QEMU - ____________________________________ Paul Luse Sr. Staff Engineer PCG Server Software Engineering Desk: 480.554.3688, Mobile: 480.334.4630 -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.e.luse at intel.com Wed Feb 15 15:33:25 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Wed, 15 Feb 2012 23:33:25 +0000 Subject: [nvmewin] Question on Processor to MSI vector mapping In-Reply-To: References: , <82C9F782B054C94B9FC04A331649C77A056026@FMSMSX106.amr.corp.intel.com> , <82C9F782B054C94B9FC04A331649C77A05605E@FMSMSX106.amr.corp.intel.com> , <82C9F782B054C94B9FC04A331649C77A0563AF@FMSMSX106.amr.corp.intel.com> <82C9F782B054C94B9FC04A331649C77A056419@FMSMSX106.amr.corp.intel.com>, <82C9F782B054C94B9FC04A331649C77A056466@FMSMSX106.amr.corp.intel.com> Message-ID: <82C9F782B054C94B9FC04A331649C77A056663@FMSMSX106.amr.corp.intel.com> Cool... the reason I don't think have the SQ close to the submission core is because we write to that address (not read) and I believe those are posted writes. The real benefit from having our queue on the same core is on the completion side where we have to read the CQEs. On the 2nd thing, yes I do mean that DPC flag but there's a patch pending to turn it off. It's not buying us anything regardless of whether we complete in the DPC or in an ISR however I did find that with it on, and with our per LUN queue depth's set incorrectly (fix is also part of a pending patch) then our performance goes haywire under heavy load as somehow we're confusing storport to the point that only one core gets heavily used. The problem goes away when we properly set our queue depths but its just an example of storport interaction that we don't need if we use the learning method. I like being independent from as many other elements as I can wrt this vector matching thing; a new stoport or a new OS method of programming the APIC, whatever can't affect us with the learning method. One other thing I didn't mention was how vectors are mapped with multiple CPU packages, I've heard there are some issues with the API when you have several processors - they can be worked around but again the learning method won't be affected, no matter how the vectors are assigned to CPUs we'll get the mapping correct after 1 IO per CPU has completed and we can do it with very little overhead (likely less than incurred by using the storport APIs I would guess). Thx Paul From: Greg de Valois [mailto:Greg.deValois at SanDisk.com] Sent: Wednesday, February 15, 2012 3:31 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Paul: Ok, what I wasn't getting is that you're allocating the queue pairs using SpecifyCache with the proc number. Got it. But if you think that having the SQ queue memory itself allocated close to the associated core doesn't matter so much (I would think it wouldn't, since the driver is accessing other memory that is not associated with that core either), now I don't know why you wouldn't then just use the StorPort API. Also, I'm not sure what you mean by the following: Would still avoid the storport API because I don't think it adds value over the learned method and requires us to use the DPC steering option which I've witnessed to have unpredictable side effects. By "DPC steering" I assume you mean enabling the STOR_PERF_DPC_REDIRECTION flag for StorPortInitializePerfOpts, which I see the driver is doing, or are you referring to something else? But I understand what you're doing now with the core-vector mapping. Thanks for the explanation. Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 1:20 PM To: Luse, Paul E; Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Ugh, small but important correction below From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Luse, Paul E Sent: Wednesday, February 15, 2012 2:16 PM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: Re: [nvmewin] Question on Processor to MSI vector mapping The only difference between what we're doing and what you get with either the storport API or the 'learning method' I mention is the locality of the SQ memory when the IO is submitted. With what we're doing now, we know that the submission core is local to SQ mem. We don't have that guarantee otherwise. Its negligible I believe which is why I'm proposing we move to the learning mode (very small cod change, 've made it and tested). I prefer learning mode over the storport API because it works regardless of OS dependencies, APIC modes, etc. As an example of the tiny SQ mem difference, take this scenario Core 0: SQ0 and CQ0 are allocated local to core 0. Core 1: SQ1 and CQ1 are allocated local to core 1. Core 2: SQ2 and CQ2 are allocated local to core 2. In our current implementation if we get an IO coming in on Core 1: We submit it to SQ1 (local mem), when we created CQ we knew the vector so we also know that this IO will complete on CQ1 Using the storport API, if we get an IO coming in Core 1: There's no assurance that we'll be using SQ1. If the API tells us that the vector is 2, for example, we'd then choose to submit on *SQ2* which is local to core 2, not core 1. On the completion side, everything is just as optimal. The IO completes on core 2 which both SQ/CQ for core2 are local to core 2 Does that help? Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 2:03 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping I'm not understanding why it matters which core has been assigned which vector, as we have IO associated with a vector, put onto a queue associated with that vector, and the controller completes the IO using that vector. Why should the driver and controller care what the mapping of vectors to processors is? Thanks, Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 12:46 PM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping I think the point you're missing (or I'm not understanding) is when the driver creates the CQ it tells the device which MSIX vector to use for that CQ. We don't know that at the time we create the CQ. All we know when we create the CQ is the number of vectors and the vector numbers but we don't know which one is associated with which core. The API only provides that info on a per IO basis. So, everything you describe is correct however the downside is (not that big of a deal but as compared to what we're doing now) that because we don't know the mapping when we create the SQ/CQ pair we may be placing the IO on a different core than the SQ is associated with because the API told us about a vector that is associated with a different Q pair than that we setup for the current core. Feel free to call me if that would be easier - 480 554 3688 Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 1:40 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping My understanding is that Windows assigns the number of MSI vectors requested (if possible) by the INF to an adapter, each of which is associated with one of the available processors. The driver sets up a queue pair for each of those vectors, and when an IO comes, it is put on the queue associated with the vector that is returned from the API call. This vector is one that has been assigned to the current executing core. As long as that API works correctly, the driver doesn't have to do anything more to assure that the IO is submitted and completed on that vector. Is there some part of those statements that isn't correct? I'm a little confused here. Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 11:14 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Because we create the SQ and CQ before the first IO and we have to provide the CQ vector when created. So, for example lets say we have 3 cores and an SQ/CQ pair all numbered the same as the core #. When we create them we arbitrarily give the CQ a vector to complete on. When the first OI comes lets say its on core 1 and the stoport PA tells us the vector we should expect it on is 3, we'd need to submit on SQ3 but we're on core 1. There are ways around this of course, we could do some creative things in passiveInit wrt the creation of the CQ's and submitting test IOs, re-creating, etc., but those things I don't think are worth the complexity. Make sense? Thx Paul From: Greg de Valois [mailto:Greg.deValois at sandisk.com] Sent: Wednesday, February 15, 2012 12:06 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Paul: Thanks for the reply. I'm afraid I'm not quite following you: why does using the API provided by Storport to get the MSI vector associated with the current request imply that you're submitting on a different core than completing? The core you're executing on is the one assigned the vector that the API gives you, you put it on the submission queue that has been assigned to that vector, and it's completed on the completion queue for that vector as well. What am I missing here? Greg ________________________________ From: Luse, Paul E [paul.e.luse at intel.com] Sent: Wednesday, February 15, 2012 10:57 AM To: Greg de Valois; nvmewin at lists.openfabrics.org Subject: RE: Question on Processor to MSI vector mapping Hi Greg- Thanks for the question, it's a good one! The reasoning was mainly because it was fairly straightforward and had one benefit over the method you mention - we could assure NUMA optimization and vector matching for both SQ and CQ. There are many, many ways that one could approach this problem and we discussed a few as part of the dev of this driver and then individually discussed experiences with various methods with other dev teams at our respective companies. If one uses the method you mention below, we'd create our SQ/CQ per core NUMA optimized and then be submitting on a different core than we complete on however we'd still be optimizing the completion side. The more I thought about this the more I realize its actually not buying us much of anything over using the Msft API due to the effects of CPU cache and the fact that the SQ access on submission is a write and not a read. I also heard from various other folks that they didn't find the API below to be accurate all of the time, I can't say from experience that I've seen this. That said, I will likely be proposing an alternate method in the near future so I'll go ahead and propose it now since you brought up the subject: Proposal: no long decompose the MSI address to populate the mapping table. Instead, start off with a 1:1 mapping and 'learn and update' the mapping table on the completion side. Would still avoid the storport API because I don't think it adds value over the learned method and requires us to use the DPC steering option which I've witnessed to have unpredictable side effects. I do plan on following up with Msft (and have already had several internal discussions at Intel with other storport devs) on exactly how some of these optimizations within storport so we can better gauge whether they're a good fit for us or now. Pro: This will *always* work whereas the method we have no does not work for APIC logical mode. I prefer a simple "one size fits all" solution every time over a 2 path solution even if one path is slightly optimized. It makes the driver more maintainable and gives us less variables in debug (right now we don't even store whether we found the APIC in phy or logical mode so during debug you don't really know). Con: SQ mem will be on a different core than the submitting thread but I don't believe this is a measurable issue. Certainly can perform some experiments to check though Other thoughts? Thx Paul From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Greg de Valois Sent: Tuesday, February 14, 2012 5:02 PM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] Question on Processor to MSI vector mapping All: I'm wondering if anyone can explain to me the reasoning behind the processor to MSI vector translation that is being done by the driver, instead of using the vector returned from StorPortGetStartIoPerfParams for each IO? Are there cases where this doesn't work properly? Thanks, Greg de Valois SanDisk ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). -------------- next part -------------- An HTML attachment was scrubbed... URL: From raymond.c.robles at intel.com Thu Feb 16 14:38:53 2012 From: raymond.c.robles at intel.com (Robles, Raymond C) Date: Thu, 16 Feb 2012 22:38:53 +0000 Subject: [nvmewin] nvmewin DB is locked - Merging performance fixes and misc. bug fix changes Message-ID: <49158E750348AA499168FD41D88983600676BB@FMSMSX106.amr.corp.intel.com> [Description: Description: Description: Description: cid:image001.png at 01CB3870.4BB88E70] Raymond C. Robles PSG Server Software Engineering Datacenter and Connected Systems Group Intel Corporation Office - 480.554.2600 | Mobile - 480.399.0645 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 1756 bytes Desc: image001.png URL: From raymond.c.robles at intel.com Thu Feb 16 16:46:41 2012 From: raymond.c.robles at intel.com (Robles, Raymond C) Date: Fri, 17 Feb 2012 00:46:41 +0000 Subject: [nvmewin] nvmewin DB is unlocked - Merging performance fixes and misc. bug fix changes Message-ID: <49158E750348AA499168FD41D88983600678E8@FMSMSX106.amr.corp.intel.com> The latest patch for the performance fix and misc. bug fixes has been applied to the trunk. There is also a new tag for this patch named "performance_fixes" under the tags directory. And as per our process, the latest and greatest remains on the trunk. Please update your source to reflect that latest changes to the baseline. As always, if you have any questions, please feel free to ask. Thanks, Ray [Description: Description: Description: Description: cid:image001.png at 01CB3870.4BB88E70] Raymond C. Robles PSG Server Software Engineering Datacenter and Connected Systems Group Intel Corporation Office - 480.554.2600 | Mobile - 480.399.0645 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 1756 bytes Desc: image001.png URL: From paul.e.luse at intel.com Fri Feb 17 08:09:28 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Fri, 17 Feb 2012 16:09:28 +0000 Subject: [nvmewin] first release thoughts In-Reply-To: <548C5470AAD9DA4A85D259B663190D3689ED@corpmail1.na.ads.idt.com> References: <82C9F782B054C94B9FC04A331649C77A039820@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D3689ED@corpmail1.na.ads.idt.com> Message-ID: <82C9F782B054C94B9FC04A331649C77A05854F@FMSMSX106.amr.corp.intel.com> All- Both patches are now in and the Intel testing is done (as part of testing the last patch submission). The details of our last testing were provided previously but as a refresher: Tested (2008-R2 with Chatham and Win7-64 with QEMU, with and without driver verifier): - Sdstress - SCSI compliance (write 10 fails, not clear why as trace shows no issue. Fails with baseline code also, note related to these changes) - BusTRACE scsi compliance - BusTRACE data integrity - Iometer all access specs, Q depth 32 8 workers And since I have done some basic testing with Server8 8220 as well. Would like at a minimum to have LSI and IDT chime in with favorable testing on the latest plus, of course, any additional input either way from anyone. Once we have those inputs we'll do a build and post it. Alex had asked about signing the driver, at this point in time I think our best plan is to not sign the driver and basically provide the binary as a testing vehicle (and to help IHVs who may have issues testing from a driver they built, they can test with our binary as well). IHV's will need to build and sign the driver for their product. Let me know if anyone has any issues with that. I don't plan to call a meeting to bless this release but will set one up for later in Mar to discuss plans for our next release (what, when, why). Thanks Paul ________________________________ From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Luse, Paul E Sent: Thursday, February 02, 2012 11:30 AM To: nvmewin at lists.openfabrics.org Subject: [nvmewin] first release thoughts I believe we agreed on 2 patches prior to first release. 1) Alex's Format PT IOCTL: I understand the review went well so suspect as soon as Rick & Ray take a look at the final patch that will go in 2) My performance & stability patch: I'll rebase and re-test once Alex's goes in and then send out for eview Once mine goes in, wanted to level set real quick that we would have at least(more are welcome) IDT, LSI and Intel run the following in their environments and/or on QEMU: - Iometer script - BusTRACE SCSI check and busTRACE data integrity (for those who have it) - Msft SCSI compliance - Msft sdstress All of these will be run in the same manner as we ran them before and we'll document what that means for everyone else before the release and post notes with the release. I don't want to post the tools though, folks can grab those on their own if they'd like. I suspect this will put our first release in mid to late Mar. I'll probably schedule a short call around then so we can all confirm that we're ready and review what it is that we're posting for our very first binary release Thanks! Paul ____________________________________ Paul Luse Sr. Staff Engineer PCG Server Software Engineering Desk: 480.554.3688, Mobile: 480.334.4630 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alex.Chang at idt.com Mon Feb 27 11:02:23 2012 From: Alex.Chang at idt.com (Chang, Alex) Date: Mon, 27 Feb 2012 19:02:23 +0000 Subject: [nvmewin] INTx In-Reply-To: <82C9F782B054C94B9FC04A331649C77A05854F@FMSMSX106.amr.corp.intel.com> References: <82C9F782B054C94B9FC04A331649C77A039820@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D3689ED@corpmail1.na.ads.idt.com> <82C9F782B054C94B9FC04A331649C77A05854F@FMSMSX106.amr.corp.intel.com> Message-ID: <548C5470AAD9DA4A85D259B663190D36CE9B@corpmail1.na.ads.idt.com> Hi Paul, I believe the newly added DPC scheme crashes the system when using INTx. In IoCompletionDpcRoutine, the driver calls StorPortAcquireMSISpinLock with a specific MsgID causes blue screen. I should have tested it earlier. Should we call StorPortAcquireSpinLock instead? Thanks, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.e.luse at intel.com Mon Feb 27 11:29:31 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Mon, 27 Feb 2012 19:29:31 +0000 Subject: [nvmewin] INTx In-Reply-To: <548C5470AAD9DA4A85D259B663190D36CE9B@corpmail1.na.ads.idt.com> References: <82C9F782B054C94B9FC04A331649C77A039820@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D3689ED@corpmail1.na.ads.idt.com> <82C9F782B054C94B9FC04A331649C77A05854F@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D36CE9B@corpmail1.na.ads.idt.com> Message-ID: <82C9F782B054C94B9FC04A331649C77A05F5CF@FMSMSX106.amr.corp.intel.com> Hi Alex, Yep, you need to grab/release the DpcLock instead of the per message lock when in INTx mode - good catch. Thx Paul From: Chang, Alex [mailto:Alex.Chang at idt.com] Sent: Monday, February 27, 2012 12:02 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: INTx Hi Paul, I believe the newly added DPC scheme crashes the system when using INTx. In IoCompletionDpcRoutine, the driver calls StorPortAcquireMSISpinLock with a specific MsgID causes blue screen. I should have tested it earlier. Should we call StorPortAcquireSpinLock instead? Thanks, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alex.Chang at idt.com Mon Feb 27 15:43:16 2012 From: Alex.Chang at idt.com (Chang, Alex) Date: Mon, 27 Feb 2012 23:43:16 +0000 Subject: [nvmewin] INTx In-Reply-To: <82C9F782B054C94B9FC04A331649C77A05F5CF@FMSMSX106.amr.corp.intel.com> References: <82C9F782B054C94B9FC04A331649C77A039820@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D3689ED@corpmail1.na.ads.idt.com> <82C9F782B054C94B9FC04A331649C77A05854F@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D36CE9B@corpmail1.na.ads.idt.com> <82C9F782B054C94B9FC04A331649C77A05F5CF@FMSMSX106.amr.corp.intel.com> Message-ID: <548C5470AAD9DA4A85D259B663190D36CF06@corpmail1.na.ads.idt.com> There is another bug in IoCompletionDpcRoutine. The Boolean variable "InterruptClaimed" needs to be reset to FALSE when checking next completion queues as below. Otherwise, it might write some bogus values to the Head Pointer of the queues. if (InterruptClaimed == TRUE) { /* Now update the Completion Head Pointer via Doorbell register */ StorPortWriteRegisterUlong(pAE, pCQI->pCplHDBL, (ULONG)pCQI->CplQHeadPtr); InterruptClaimed = FALSE; } ________________________________ From: Luse, Paul E [mailto:paul.e.luse at intel.com] Sent: Monday, February 27, 2012 11:30 AM To: Chang, Alex; nvmewin at lists.openfabrics.org Subject: RE: INTx Hi Alex, Yep, you need to grab/release the DpcLock instead of the per message lock when in INTx mode - good catch. Thx Paul From: Chang, Alex [mailto:Alex.Chang at idt.com] Sent: Monday, February 27, 2012 12:02 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: INTx Hi Paul, I believe the newly added DPC scheme crashes the system when using INTx. In IoCompletionDpcRoutine, the driver calls StorPortAcquireMSISpinLock with a specific MsgID causes blue screen. I should have tested it earlier. Should we call StorPortAcquireSpinLock instead? Thanks, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From paul.e.luse at intel.com Tue Feb 28 05:25:19 2012 From: paul.e.luse at intel.com (Luse, Paul E) Date: Tue, 28 Feb 2012 13:25:19 +0000 Subject: [nvmewin] INTx In-Reply-To: <548C5470AAD9DA4A85D259B663190D36CF06@corpmail1.na.ads.idt.com> References: <82C9F782B054C94B9FC04A331649C77A039820@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D3689ED@corpmail1.na.ads.idt.com> <82C9F782B054C94B9FC04A331649C77A05854F@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D36CE9B@corpmail1.na.ads.idt.com> <82C9F782B054C94B9FC04A331649C77A05F5CF@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D36CF06@corpmail1.na.ads.idt.com> Message-ID: <82C9F782B054C94B9FC04A331649C77A05FE44@FMSMSX106.amr.corp.intel.com> I agree, if you are preparing a patch for the INTx lock change you can put this in there as well. Thx Paul From: Chang, Alex [mailto:Alex.Chang at idt.com] Sent: Monday, February 27, 2012 4:43 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: INTx There is another bug in IoCompletionDpcRoutine. The Boolean variable "InterruptClaimed" needs to be reset to FALSE when checking next completion queues as below. Otherwise, it might write some bogus values to the Head Pointer of the queues. if (InterruptClaimed == TRUE) { /* Now update the Completion Head Pointer via Doorbell register */ StorPortWriteRegisterUlong(pAE, pCQI->pCplHDBL, (ULONG)pCQI->CplQHeadPtr); InterruptClaimed = FALSE; } ________________________________ From: Luse, Paul E [mailto:paul.e.luse at intel.com] Sent: Monday, February 27, 2012 11:30 AM To: Chang, Alex; nvmewin at lists.openfabrics.org Subject: RE: INTx Hi Alex, Yep, you need to grab/release the DpcLock instead of the per message lock when in INTx mode - good catch. Thx Paul From: Chang, Alex [mailto:Alex.Chang at idt.com] Sent: Monday, February 27, 2012 12:02 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: INTx Hi Paul, I believe the newly added DPC scheme crashes the system when using INTx. In IoCompletionDpcRoutine, the driver calls StorPortAcquireMSISpinLock with a specific MsgID causes blue screen. I should have tested it earlier. Should we call StorPortAcquireSpinLock instead? Thanks, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From Alex.Chang at idt.com Tue Feb 28 09:07:28 2012 From: Alex.Chang at idt.com (Chang, Alex) Date: Tue, 28 Feb 2012 17:07:28 +0000 Subject: [nvmewin] INTx In-Reply-To: <82C9F782B054C94B9FC04A331649C77A05FE44@FMSMSX106.amr.corp.intel.com> References: <82C9F782B054C94B9FC04A331649C77A039820@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D3689ED@corpmail1.na.ads.idt.com> <82C9F782B054C94B9FC04A331649C77A05854F@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D36CE9B@corpmail1.na.ads.idt.com> <82C9F782B054C94B9FC04A331649C77A05F5CF@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D36CF06@corpmail1.na.ads.idt.com> <82C9F782B054C94B9FC04A331649C77A05FE44@FMSMSX106.amr.corp.intel.com> Message-ID: <548C5470AAD9DA4A85D259B663190D36CF76@corpmail1.na.ads.idt.com> Thanks, Paul. I will do more testing before requesting a patch that includes both changes. Alex ________________________________ From: Luse, Paul E [mailto:paul.e.luse at intel.com] Sent: Tuesday, February 28, 2012 5:25 AM To: Chang, Alex; nvmewin at lists.openfabrics.org Subject: RE: INTx I agree, if you are preparing a patch for the INTx lock change you can put this in there as well. Thx Paul From: Chang, Alex [mailto:Alex.Chang at idt.com] Sent: Monday, February 27, 2012 4:43 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: INTx There is another bug in IoCompletionDpcRoutine. The Boolean variable "InterruptClaimed" needs to be reset to FALSE when checking next completion queues as below. Otherwise, it might write some bogus values to the Head Pointer of the queues. if (InterruptClaimed == TRUE) { /* Now update the Completion Head Pointer via Doorbell register */ StorPortWriteRegisterUlong(pAE, pCQI->pCplHDBL, (ULONG)pCQI->CplQHeadPtr); InterruptClaimed = FALSE; } ________________________________ From: Luse, Paul E [mailto:paul.e.luse at intel.com] Sent: Monday, February 27, 2012 11:30 AM To: Chang, Alex; nvmewin at lists.openfabrics.org Subject: RE: INTx Hi Alex, Yep, you need to grab/release the DpcLock instead of the per message lock when in INTx mode - good catch. Thx Paul From: Chang, Alex [mailto:Alex.Chang at idt.com] Sent: Monday, February 27, 2012 12:02 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: INTx Hi Paul, I believe the newly added DPC scheme crashes the system when using INTx. In IoCompletionDpcRoutine, the driver calls StorPortAcquireMSISpinLock with a specific MsgID causes blue screen. I should have tested it earlier. Should we call StorPortAcquireSpinLock instead? Thanks, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: From raymond.c.robles at intel.com Tue Feb 28 09:26:59 2012 From: raymond.c.robles at intel.com (Robles, Raymond C) Date: Tue, 28 Feb 2012 17:26:59 +0000 Subject: [nvmewin] INTx In-Reply-To: <548C5470AAD9DA4A85D259B663190D36CF76@corpmail1.na.ads.idt.com> References: <82C9F782B054C94B9FC04A331649C77A039820@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D3689ED@corpmail1.na.ads.idt.com> <82C9F782B054C94B9FC04A331649C77A05854F@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D36CE9B@corpmail1.na.ads.idt.com> <82C9F782B054C94B9FC04A331649C77A05F5CF@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D36CF06@corpmail1.na.ads.idt.com> <82C9F782B054C94B9FC04A331649C77A05FE44@FMSMSX106.amr.corp.intel.com> <548C5470AAD9DA4A85D259B663190D36CF76@corpmail1.na.ads.idt.com> Message-ID: <49158E750348AA499168FD41D889836007645A@FMSMSX105.amr.corp.intel.com> On the topic of the next release, did we have a tentative date in mind? Thanks, Ray From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Chang, Alex Sent: Tuesday, February 28, 2012 10:07 AM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: Re: [nvmewin] INTx Thanks, Paul. I will do more testing before requesting a patch that includes both changes. Alex ________________________________ From: Luse, Paul E [mailto:paul.e.luse at intel.com] Sent: Tuesday, February 28, 2012 5:25 AM To: Chang, Alex; nvmewin at lists.openfabrics.org Subject: RE: INTx I agree, if you are preparing a patch for the INTx lock change you can put this in there as well. Thx Paul From: Chang, Alex [mailto:Alex.Chang at idt.com] Sent: Monday, February 27, 2012 4:43 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: RE: INTx There is another bug in IoCompletionDpcRoutine. The Boolean variable "InterruptClaimed" needs to be reset to FALSE when checking next completion queues as below. Otherwise, it might write some bogus values to the Head Pointer of the queues. if (InterruptClaimed == TRUE) { /* Now update the Completion Head Pointer via Doorbell register */ StorPortWriteRegisterUlong(pAE, pCQI->pCplHDBL, (ULONG)pCQI->CplQHeadPtr); InterruptClaimed = FALSE; } ________________________________ From: Luse, Paul E [mailto:paul.e.luse at intel.com] Sent: Monday, February 27, 2012 11:30 AM To: Chang, Alex; nvmewin at lists.openfabrics.org Subject: RE: INTx Hi Alex, Yep, you need to grab/release the DpcLock instead of the per message lock when in INTx mode - good catch. Thx Paul From: Chang, Alex [mailto:Alex.Chang at idt.com] Sent: Monday, February 27, 2012 12:02 PM To: Luse, Paul E; nvmewin at lists.openfabrics.org Subject: INTx Hi Paul, I believe the newly added DPC scheme crashes the system when using INTx. In IoCompletionDpcRoutine, the driver calls StorPortAcquireMSISpinLock with a specific MsgID causes blue screen. I should have tested it earlier. Should we call StorPortAcquireSpinLock instead? Thanks, Alex -------------- next part -------------- An HTML attachment was scrubbed... URL: