[nvmewin] review comments for patch from Google

Tue Jun 14 13:21:27 PDT 2016

Crash dump mode is single threaded and no interrupts, so synchronization between submission and completion side is not necessary. We poll for completions in the crash dump mode.

From: Iuliu Rus [mailto:iuliur at google.com]
Sent: Tuesday, June 14, 2016 12:13 PM
To: Thomas Freeman <thomas.freeman at hgst.com>
Cc: Robles, Raymond C <raymond.c.robles at intel.com>; nvmewin <nvmewin at lists.openfabrics.org>
Subject: Re: [nvmewin] review comments for patch from Google

Actually looking at it some more, it looks like that's needed in the dump mode.

On Tue, Jun 14, 2016 at 12:01 PM, Thomas Freeman <thomas.freeman at hgst.com<mailto:thomas.freeman at hgst.com>> wrote:
I agree.
Thanks,
Tom Freeman
Software Engineer, Device Manager and Driver Development
Western Digital Corporation
e.  Thomas.freeman at hgst.com<mailto:Thomas.freeman at hgst.com>
o.  +1-507-322-2311<tel:%2B1-507-322-2311>

[cid:image001.jpg at 01D1C63F.A6983FD0]

From: Iuliu Rus [mailto:iuliur at google.com<mailto:iuliur at google.com>]
Sent: Tuesday, June 14, 2016 1:25 PM

To: Thomas Freeman <thomas.freeman at hgst.com<mailto:thomas.freeman at hgst.com>>
Cc: Robles, Raymond C <raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>>; nvmewin <nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>>
Subject: Re: [nvmewin] review comments for patch from Google

I don't think it's needed. The IOCompletionRoutine always taking the dpc lock should be enough. Unless i'm missing something... If you agree I'll add these to the patch.

On Tue, Jun 14, 2016 at 11:08 AM, Thomas Freeman <thomas.freeman at hgst.com<mailto:thomas.freeman at hgst.com>> wrote:
I think that is a good option.

I wonder at this point, is there any need for the StartIo lock?
Along with RecoveryDpcRoutine, I see the Startio lock is acquired in ProcessIo (when Acquire lock flag is set) and IoCompletionRoutine (when MultipleCoresToSingleQueueFlag is set).
In this case, multiple IoCompletionRoutine DPC’s will be forced to operate serially when IoCompletionRoutine is acquiring the StartIo lock.

Tom Freeman
Software Engineer, Device Manager and Driver Development
Western Digital Corporation
e.  Thomas.freeman at hgst.com<mailto:Thomas.freeman at hgst.com>
o.  +1-507-322-2311<tel:%2B1-507-322-2311>

[cid:image002.jpg at 01D1C63F.A6983FD0]

From: Iuliu Rus [mailto:iuliur at google.com<mailto:iuliur at google.com>]
Sent: Tuesday, June 14, 2016 12:12 PM
To: Thomas Freeman <thomas.freeman at hgst.com<mailto:thomas.freeman at hgst.com>>
Cc: Robles, Raymond C <raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>>; nvmewin <nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>>

Subject: Re: [nvmewin] review comments for patch from Google

I think we can fix the recovery dpc routine by instead taking every dpc lock in addition or instead the startio lock. Do you see any problems with that?

On Tue, Jun 14, 2016 at 7:41 AM, Thomas Freeman <thomas.freeman at hgst.com<mailto:thomas.freeman at hgst.com>> wrote:
Hi Iuliu,
While looking at this change, it caused me to review other code paths that utilize the StartIo lock.

That brought up 2 concerns
1. The routine recoveryDpcRoutine gets the Startio lock to ensure no new sub/cpl queue entries are processed on any CPU. With the changes for concurrent channels, that synchronization no longer occurs.

2. The second concern relates to an issue that existed before your change - There is a path for the driver to deadlock on attempting to get the StartIo lock.

An example of the deadlock occurs when a command is issued to delete an attached namespace AND MultipleCoresToSingleQueueFlag is set to TRUE.
In this example, when the namespace delete is issued, here are the steps leading up to the deadlock:
-Driver issues command Namespace Attachment (SEL = 1, controller detach)
-Command completion is processed by IoCompletionRoutine. In that method the driver acquires the StartIo lock (if MultipleCoresToSingleQueueFlag is set)
-IoCompletionRoutine calls the completion method - pNvmeCompletionRoutine. That eventually ends up in NVMeCompletionNSAttachment.
-NVMeCompletionNSAttachment proceeds to the next step, building the Namespace management command (SEL = 1, delete).
-NVMeCompletionNSAttachment calls ProcessIo (with AcquireLock = TRUE).
-The first step in ProcessIo is to acquire the StartIo lock, which is already held by IoCompletionRoutine.

I'm not sure what value is provided by having ProcessIo getting the StartIo lock. Removing that step in ProcessIo would certainly solve this deadlock. I don't see it causing other problems. Maybe some others get express an opinion on this.

Regards,
Tom Freeman
Software Engineer, Device Manager and Driver Development
Western Digital Corporation
e.  Thomas.freeman at hgst.com<mailto:Thomas.freeman at hgst.com>
o.  +1-507-322-2311<tel:%2B1-507-322-2311>

[cid:image003.jpg at 01D1C63F.A6983FD0]

From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org>] On Behalf Of Iuliu Rus
Sent: Friday, June 10, 2016 2:31 PM
To: Robles, Raymond C <raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>>
Cc: nvmewin <nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>>

Subject: Re: [nvmewin] review comments for patch from Google

Our tests are still running and i'll be in meetings for the rest of the day, so i'm sending you my patch now so you can evaluate and review.

On Fri, Jun 10, 2016 at 10:24 AM, Iuliu Rus <iuliur at google.com<mailto:iuliur at google.com>> wrote:
I'm just running through the tests and will send it to you as soon as they pass.

On Fri, Jun 10, 2016 at 10:22 AM, Robles, Raymond C <raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>> wrote:
Hi Iuliu,

If you send out the updated patch with the protection in place for both list entries and doorbells, I can run it on an Intel NVMe SSD (P3700) and see if there is any performance impact (up or down).

Thank you…
Ray

From: Iuliu Rus [mailto:iuliur at google.com<mailto:iuliur at google.com>]
Sent: Friday, June 10, 2016 10:17 AM
To: suman.p at samsung.com<mailto:suman.p at samsung.com>
Cc: Robles, Raymond C <raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>>; Judy Brock <judy.brock at samsung.com<mailto:judy.brock at samsung.com>>; nvmewin <nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>>
Subject: Re: RE: review comments for patch from Google

we are running this in a virtual machine, and we do see a perf difference with concurrent channels. We did not observe any performance drop with synchronized access to the list entry.

On Fri, Jun 10, 2016 at 7:15 AM, SUMAN PRAKASH B <suman.p at samsung.com<mailto:suman.p at samsung.com>> wrote:

Hi Iuliu,

As you have the implementation in place, can you let us know if there is any performance improvement when STOR_PERF_CONCURRENT_CHANNELS is set in case of #core and #queues not matching.

Also, if there is any drop in performance if the list entry and doorbell is protected after setting STOR_PERF_CONCURRENT_CHANNELS.

Perhaps a bit off topic:

Even in case when #cores and #queues are same and when STOR_PERF_CONCURRENT_CHANNELS is set, we did not observe any performance gain. Maybe we are hitting the max performance from device even without setting STOR_PERF_CONCURRENT_CHANNELS. Can you please let us know your observation?

Thanks,

Suman

------- Original Message -------

Sender : Robles, Raymond C<raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>>

Date : Jun 10, 2016 08:30 (GMT+05:30)

Title : RE: review comments for patch from Google

Hi Judy,

Thanks for the additional info. I agree with the assessment. My statement was directed at the general direction of the solution… and the fact that I think option 1 is not as robust as option 2.

Thank you…
Ray

From: Judy Brock [mailto:judy.brock at samsung.com<mailto:judy.brock at samsung.com>]
Sent: Thursday, June 09, 2016 5:14 PM
To: Robles, Raymond C <raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>>; Iuliu Rus <iuliur at google.com<mailto:iuliur at google.com>>; suman.p at samsung.com<mailto:suman.p at samsung.com>
Cc: nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: review comments for patch from Google

Hi Ray et al,
[Suman wrote] >> Protect the listentry and doorbell before access/update
                             >>Do not set STOR_PERF_CONCURRENT_CHANNELS when number of cores and queues are not same.

Ray wrote] >>In the case Suman points out, concurrent channels should still be enabled even if the number of cores and queues do not match. This is especially true for large core count systems (>128 logical cpu cores) but devices may only support 32 queues or less

I think Suman was saying, we have two options:

1.       either don’t enable it at all if # cores != #queues or

2.       protect both the list entries and doorbell writes before any accesses/updates if #cores != #queues.

So we will need to check to see if we need to use locks or not in the performance path based on whether #cores and # queues DO match or not if we want to enable concurrent channels no matter what.   It’s probably better than option 1) above if this is a configuration that can come up frequently.

Thanks,
Judy

From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Robles, Raymond C
Sent: Thursday, June 09, 2016 2:32 PM
To: Iuliu Rus; suman.p at samsung.com<mailto:suman.p at samsung.com>
Cc: nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: Re: [nvmewin] review comments for patch from Google

Hi Iuliu/Suman,

I’m of the opinion that any solution provided should be as robust as possible. In the case Suman points out, concurrent channels should still be enabled even if the number of cores and queues do not match. This is especially true for large core count systems (>128 logical cpu cores) but devices may only support 32 queues or less.

Thank you…
Ray

From: Iuliu Rus [mailto:iuliur at google.com]<mailto:[mailto:iuliur at google.com]>
Sent: Thursday, June 09, 2016 2:09 PM
To: suman.p at samsung.com<mailto:suman.p at samsung.com>
Cc: Robles, Raymond C <raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>>; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: Re: review comments for patch from Google

Which one do you prefer? Our google driver has the synchronization in place already for a different reason. Or, I can follow your second suggestion.

On Thu, Jun 9, 2016 at 7:39 AM, SUMAN PRAKASH B <suman.p at samsung.com<mailto:suman.p at samsung.com>> wrote:

Hi Iuliu,

Please find my review comments below:

1. If the number of Cores and Queues supported by device are not same, then setting STOR_PERF_CONCURRENT_CHANNELS will corrupt the submission queue. For example, if device has 4 queues and connected to target PC which has 8 logical processors, and if STOR_PERF_CONCURRENT_CHANNELS is set, all the 8 logical processors will get startIo calls concurrently. As per OFA driver NUMA design, the IO Queue 1 will be shared by core 0 and core 4. There is possibility that core 0 and core 4 will get startio() calls concurrently and will access/update the listenty and queue 1 doorbell register at same time, which will lead to listentry and doorbell register corruption.
This problem can be address by following ways:
a. Protect the listentry and doorbell before access/update.
b. Do not set STOR_PERF_CONCURRENT_CHANNELS when number of cores and queues are not same.

2. We can use pRMT->NumActiveCores which is updated in NVMeEnumNumaCores() instead of using KeQueryActiveProcessorCountEx/KeQueryActiveProcessorCount.

Please let us know your opinion.

Thanks,
Suman

------- Original Message -------

Sender : Robles, Raymond C<raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>>

Date : Jun 09, 2016 03:56 (GMT+05:30)

Title : Re: [nvmewin] Happy New Year... and status update

Thank you Iuliu!

Reviewing companies, please try to target the review completion for 2 weeks from today (6/22/16).

Thank you!
Ray

From: Iuliu Rus [mailto:iuliur at google.com<mailto:iuliur at google.com>]
Sent: Wednesday, June 08, 2016 12:26 PM
To: Robles, Raymond C <raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>>
Cc: nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: Re: Happy New Year... and status update

I attached our patch.

On Tue, Jun 7, 2016 at 2:37 PM, Iuliu Rus <iuliur at google.com<mailto:iuliur at google.com>> wrote:
Let me see if i can get it done by tomorrow eod...

On Tue, Jun 7, 2016 at 2:31 PM, Robles, Raymond C <raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>> wrote:
Hi Iuliu,

That’s great… thank you. Yes, the process is still the same (using SVN patches).

Please let me know if you have any questions/comments. My only question for you is an ETA for the patch to be submitted.

Thank you…
Ray

From: Iuliu Rus [mailto:iuliur at google.com<mailto:iuliur at google.com>]
Sent: Tuesday, June 07, 2016 2:28 PM
To: Robles, Raymond C <raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>>
Cc: nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: Re: Happy New Year... and status update

Yes we can send that to you. Is the process still the same?

On Tue, Jun 7, 2016 at 1:25 PM, Robles, Raymond C <raymond.c.robles at intel.com<mailto:raymond.c.robles at intel.com>> wrote:
Hello Iuliu,

As the OFA community is closing on its mid-year release, we have only one more patch that was targeted… concurrent channels. Will Google still be submitting the concurrent channels patch?

Thank you…
Ray

From: nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org> [mailto:nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org>] On Behalf Of Robles, Raymond C
Sent: Thursday, January 14, 2016 4:55 PM
To: 'nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>' <nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>>
Subject: [nvmewin] Happy New Year... and status update

Hello and Happy New Year,

I hope everyone had a great holiday season and is off to a great start to the new year!

As communicated last month, the patch from HGST for the SCSI multi-initiator changes has been approved and pushed. The holidays slowed down much of the progress on the OFA driver and there were several patches that did not get pushed prior to the end of the year. The list of patches remained to be pushed are as follows…

•         Namespace Management (Intel)

•         Perf Opts (Samsung)

•         Win 8.1 Timers (Samsung)

•         EOL Read Only (Samsung)

•         Concurrent channels (Google)

The namespace management patch will be sent out for review tomorrow (look for the patch email from Carolyn)… stay tuned!!!

However, once this patch is resolved, we as a community will have to make a decision on an official release strategy. The patch process and cadence was significantly slower in 2015 which leaves us with a few options.

1.       Release what is in the trunk today (or after the namespace management patch)… and call that the 2015 release (albeit later than expected)

2.       In lieu of an official 2015 release, we push the remaining patches listed above… and then release in ~Q2 of 2015. Basically skip a 2015 release and go right into the mid-2016 release.

3.       Remove the concept of “official releases” from the OFA Windows NVMe driver and just allow people, companies, and users to pull from the OFA trunk as necessary.

For #3 above, my thoughts are that because the OFA driver is not a production driver… but a reference and enabling driver, it should/could be managed as just that… a reference baseline driver that any potential user can go and grab the source, or contribute a patch. Nothing more… nothing less.

For the release decision, I’ll be happy to call a meeting… but we can also handle it via email as well… just let me know. Feedback from all is welcome… but I would request mandatory feedback form the 4 reviewing companies: Samsung, HGST, PMC-Sierra, and Intel. How would you like to proceed?

Thanks,
Ray

[cid:image001.png at 01CB3870.4BB88E70]
Raymond C. Robles
Non-Volatile Memory Solutions Group
Intel Corporation
Office: 480-554-2600<tel:480-554-2600>
Mobile: 480-399-0645<tel:480-399-0645>

[cid:image005.gif at 01D1C63F.A6983FD0]

[Image removed by sender.]

[cid:image005.gif at 01D1C63F.A6983FD0]

[Image removed by sender.]

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:
This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20160614/ddb13771/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 2934 bytes
Desc: image001.jpg
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20160614/ddb13771/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 2937 bytes
Desc: image002.jpg
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20160614/ddb13771/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.jpg
Type: image/jpeg
Size: 2938 bytes
Desc: image003.jpg
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20160614/ddb13771/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 1756 bytes
Desc: image004.png
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20160614/ddb13771/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.gif
Type: image/gif
Size: 13168 bytes
Desc: image005.gif
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20160614/ddb13771/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image006.jpg
Type: image/jpeg
Size: 823 bytes
Desc: image006.jpg
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20160614/ddb13771/attachment-0003.jpg>