[nvmewin] Patch with changes for Optimizing disk initialization performance

Foster, Carolyn D carolyn.d.foster at intel.com
Tue Apr 12 16:09:26 PDT 2016


Hi Suman,
Intel approves the changes as well.

Thank you,
Carolyn

From: Thomas Freeman [mailto:thomas.freeman at hgst.com]
Sent: Tuesday, April 12, 2016 3:18 PM
To: suman.p at samsung.com; Foster, Carolyn D <carolyn.d.foster at intel.com>; nvmewin at lists.openfabrics.org
Cc: Seokhwan Kim <sukka.kim at samsung.com>; ANSHUL SHARMA <anshul at samsung.com>; MANOJ THAPLIYAL <m.thapliyal at samsung.com>; tru.nguyen at ssi.samsung.com; prakash.v at samsung.com
Subject: RE: Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance

Hi Suman,
HGST approves these changes.
Thank you,

Tom Freeman
Software Engineer, Device Manager and Driver Development
Western Digital Corporation
e.  Thomas.freeman at hgst.com<mailto:Thomas.freeman at hgst.com>
o.  +1-507-322-2311

[cid:image001.jpg at 01D194D5.AD00C9F0]

From: SUMAN PRAKASH B [mailto:suman.p at samsung.com]
Sent: Tuesday, April 12, 2016 8:24 AM
To: Foster, Carolyn D <carolyn.d.foster at intel.com<mailto:carolyn.d.foster at intel.com>>; Thomas Freeman <thomas.freeman at hgst.com<mailto:thomas.freeman at hgst.com>>; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Cc: Seokhwan Kim <sukka.kim at samsung.com<mailto:sukka.kim at samsung.com>>; ANSHUL SHARMA <anshul at samsung.com<mailto:anshul at samsung.com>>; MANOJ THAPLIYAL <m.thapliyal at samsung.com<mailto:m.thapliyal at samsung.com>>; tru.nguyen at ssi.samsung.com<mailto:tru.nguyen at ssi.samsung.com>; prakash.v at samsung.com<mailto:prakash.v at samsung.com>
Subject: Re: Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance


Hi All,



I am sending the updated patch incorporating feedback from Carolyn. The changes are listed below. The password is samsungnvme



1. In NVMeRunningWaitOnLearnMapping(), if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete.
2. In NVMeRunningWaitOnNamespaceReady(), for crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady.



Please let me know if you have any questions.



Thanks,
Suman



------- Original Message -------

Sender : SUMAN PRAKASH B<suman.p at samsung.com<mailto:suman.p at samsung.com>> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics

Date : Apr 11, 2016 21:22 (GMT+05:30)

Title : Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance



Hi Carolyn,



Thanks for testing our patch.



We tried the following:

a. Sequential read performance on single worker, single queue depth, on system with 32 cores, and did not observe performance drop compared to R133.

b. Assuming that the IOs are getting scattered across multiple NUMA nodes and due to remote memory access, the 10% drop could have been observed, we affinitized the performance tool on NUMA node 1, assuming that the IO queue is created on NUMA node 0. After affinitization, application submits IO on node 1 and driver processes the IOs in node 0. But still, we did not observe any performance drop.



As there are no major issues here, we will incorporate the following review comments and share the revised patch tomorrow:

a. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete.

b. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady.



Thanks,

Suman



------- Original Message -------

Sender : Foster, Carolyn D<carolyn.d.foster at intel.com<mailto:carolyn.d.foster at intel.com>>

Date : Apr 08, 2016 02:42 (GMT+05:30)

Title : RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance


Hi Suman,
I have an update on our performance data.  It appears on systems with more than 32 and more than 64 cores that there was no performance delta in the following workloads:
4 workers, 32 Queue Depth
8 workers, 16 Queue Depth

We did observe what looks like a more consistent (and much smaller) 10% drop on workload with a single worker, single queue depth, on systems with 32 cores.

It seems that our initial results might have been flawed and based on your comments and performance analysis, there may be no major issue here.

Carolyn

From: Foster, Carolyn D
Sent: Wednesday, April 06, 2016 2:54 PM
To: 'suman.p at samsung.com' <suman.p at samsung.com<mailto:suman.p at samsung.com>>; Thomas Freeman <thomas.freeman at hgst.com<mailto:thomas.freeman at hgst.com>>; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Cc: Seokhwan Kim <sukka.kim at samsung.com<mailto:sukka.kim at samsung.com>>; ANSHUL SHARMA <anshul at samsung.com<mailto:anshul at samsung.com>>; MANOJ THAPLIYAL <m.thapliyal at samsung.com<mailto:m.thapliyal at samsung.com>>; tru.nguyen at ssi.samsung.com<mailto:tru.nguyen at ssi.samsung.com>
Subject: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance

Hi Suman, thank you for the clarification.  I will confirm the rest of the workload details and have that information for you tomorrow.  In the mean time I will also rerun our performance tests to confirm that the results are reproducible, and will run the tests without line 594 as you suggested.
Carolyn
From: SUMAN PRAKASH B [mailto:suman.p at samsung.com]
Sent: Wednesday, April 06, 2016 9:01 AM
To: Foster, Carolyn D <carolyn.d.foster at intel.com<mailto:carolyn.d.foster at intel.com>>; Thomas Freeman <thomas.freeman at hgst.com<mailto:thomas.freeman at hgst.com>>; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Cc: Seokhwan Kim <sukka.kim at samsung.com<mailto:sukka.kim at samsung.com>>; ANSHUL SHARMA <anshul at samsung.com<mailto:anshul at samsung.com>>; MANOJ THAPLIYAL <m.thapliyal at samsung.com<mailto:m.thapliyal at samsung.com>>; tru.nguyen at ssi.samsung.com<mailto:tru.nguyen at ssi.samsung.com>
Subject: Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance


Hi Carolyn,



Thanks for the comments and suggestions. Please find my comments below:



1. Observed performance degradation, potentially related to line 594 in nvmeInit.c
a. We tested on Servers with 32 and 64 logical processors, Windows 8.1 and 2012 R2 OS, multiple vendor devices with both 1-to-1 core-queue mapping and many-to-1 core-queue mapping, with both R133 and latest drivers, and we did not observe any performance drop for 100% Sequential Read(128K) with 32 and 64 worker threads, and queue depth 32. We have tested both the StorPortInitializePerfOpts() pass case and fail case(learning cores will be executed).

b. Regarding MsgID-- in nvmeInit.c line number 594, NVMe device supports number of msg ids equal to 1 admin + N IO queues. But in OFA driver, since the msg id 0 is shared between admin queue and io queue, always ((1 admin + N io queues) - 1) number of msgids is used. With MsgID--, we make sure all IO queues are created with unique msg id and msg id 0 is shared with admin queue and 1 io queue. If the device that you are testing has total number of msg ids equal to 1 admin queue + N io queues, then MsgID-- should not be a problem. But if you strongly feel that MsgID-- could be an issue, can you please take the perf benchmark after removing MsgID--.

c. Can you please let us know the number of queues and number of Messaged IDs supported by the target device that you are testing with.

d. On servers, we usually test on Windows Server edition OSes. When we tested with Windows 8.1, we observed that the number of logical processors supported in Windows 8.1 is maximum 32, even when the server have more than 32 logical processors.

We will try to reproduce the performance degradation behavior, meanwhile can you please provide us more debug data.



2. nvmeStat.c @ line 784 : Agreed. We will change as suggested.



3. nvmeStat.c @line 899 : Agreed. We will change as suggested.



Thanks,
Suman



------- Original Message -------

Sender : Foster, Carolyn D<carolyn.d.foster at intel.com<mailto:carolyn.d.foster at intel.com>>

Date : Apr 06, 2016 05:33 (GMT+05:30)

Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance



Hi Suman,



I have a few comments and suggestions:

1.       Observed performance degradation, potentially related to line 594 in nvmeInit.c – We noticed on some systems we saw a degradation in performance and we suspect it’s related to this change.  If we don’t share MSIX vector 0 between the admin queue and an IO queue we are creating one fewer queue to submit IO to.  Did you execute any performance testing before and after these changes?  I have included some details about the system configurations we tested and the observed results below.

2.       nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete.

3.       nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady.





Performance configuration and data:



OS: Windows 8.1 x64

Workload: 100% sequential Read

Compared the OFA trunk to the Samsung patch



Summary of observed results:

•         System with fewer than 32 logical CPU cores:

o   No delta in performance observed

•         System with between 32 and 64 cores:

o   20%-50% drop in performance observed

•         System with more than 64 cores:

o   30%-40% drop in performance observed



Carolyn


From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B
Sent: Monday, April 04, 2016 6:35 AM
To: Thomas Freeman <thomas.freeman at hgst.com<mailto:thomas.freeman at hgst.com>>; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Cc: Seokhwan Kim <sukka.kim at samsung.com<mailto:sukka.kim at samsung.com>>; anshul at samsung.com<mailto:anshul at samsung.com>; MANOJ THAPLIYAL <m.thapliyal at samsung.com<mailto:m.thapliyal at samsung.com>>; tru.nguyen at ssi.samsung.com<mailto:tru.nguyen at ssi.samsung.com>
Subject: Re: [nvmewin] Patch with changes for Optimizing disk initialization performance


Hi all,



I am sending the updated patch incorporating feedback from Tom. The changes are listed below. The password is samsungnvme



1. Moved the StorPortFreePool()from IoCompletionRoutine() to NvmeInitCallback() - NvmeWaitOnNamespaceReady.

2. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus willnot be ONLINE if lun extension is zero'ed out.

3. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS.



To Mandatory reviewers: Can we get feedback or approval for this patch before 7th April.



Thanks,

Suman



------- Original Message -------

Sender : SUMAN PRAKASH B<suman.p at samsung.com<mailto:suman.p at samsung.com>> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics

Date : Mar 29, 2016 20:27 (GMT+05:30)

Title : Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance



Hi Tom,



Thanks for the review comments. Please find my replies below:



1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization.
[Suman] Agreed. We will move the StorPortFreePool to NvmeInitCallback.



2.a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace.
[Suman] Following changes are made:
a. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus will not be ONLINE if lun extension is zero'ed out.
b. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS.

Let me know if you have any questions.

We will share the modified code once others share their feedback. Can we get feedback from other companies by 5th April?

Thanks,
Suman



------- Original Message -------

Sender : Thomas Freeman<thomas.freeman at hgst.com<mailto:thomas.freeman at hgst.com>>

Date : Mar 29, 2016 00:48 (GMT+05:30)

Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance


Hi Suman,
It looks good.
I have a few comments here:

1.       nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization.

2.       I ran into a few problems, here are the details:

*My device configuration: I'm testing a device that supports NS management and it has multiple namespaces. Some of those namespaces are not attached. The format of some of those namespaces is not supported by the driver (e.g. LBA contains metadata)

a.       With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace.

                                                   i.      If that lun is a detached namespace, the READ fails with a status code of 0xb. The driver attempts to retry until the READ is successful, but the command will never succeed.

                                                 ii.      During initialization, if the driver detects a namespace that is in an unsupported format, it zero's out that LUN entry, but leaves that zero’ed entry in the LUN extension list. When NVMeRunningWaitOnNamespaceReady is processing the list, it does not recognize this as a zero'ed out entry. Rather is attempts a READ from this namespace (the NSID is 0 since the init code zero'ed out that Lun list entry). The READ and all of its retries fail with a status code of 0xb.
Proposed fix:

1.       Before issuing a READ, ensure that namespace is attached and a valid format. If not, increment the counters and move to the next Lun.

2.       Also, in NVMeInitCallback, when handling the case NVMeWaitOnNamespaceReady, instead of looking for an SC of 0x00, only issue a retry if the command fails with SC = 0x82 (NS not ready).
Let me know if you have any questions.
Tom Freeman
Software Engineer, Device Manager and Driver Development
Western Digital Corporation
e.  Thomas.freeman at hgst.com<mailto:Thomas.freeman at hgst.com>
o.  +1-507-322-2311

[http://www.samsung.net/service/ml/AttachController/image001.jpg?cmd=downdirectly&filepath=/LOCAL/ML/CACHE/s/20160406/image001.jpg@01D18E55.4C7BA130309suman.p&contentType=IMAGE/JPEG;charset=UTF8&msgno=309&partno=2&foldername=INBOX&msguid=48143]

From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B
Sent: Wednesday, March 23, 2016 7:27 AM
To: nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Cc: Seokhwan Kim <sukka.kim at samsung.com<mailto:sukka.kim at samsung.com>>; MANOJ THAPLIYAL <m.thapliyal at samsung.com<mailto:m.thapliyal at samsung.com>>; tru.nguyen at ssi.samsung.com<mailto:tru.nguyen at ssi.samsung.com>
Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance


Hi all,



This patch includes changes for optimizing the disk initialization performance and relevant changes.

I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code.



Password is samsungnvme



Please let me know if you have any questions.



Thanks,

Suman



******************



Disk Initialization Performance Optimization:

We can use the StorPortInitializePerfOpts(), PERF_CONFIGURATION_DATA.MessageTargets which provides the array of MSI numbers corresponding to each logical processor. This is an alternative of using the Learning cores logic implemented in the OFA driver.



Also this will directly reduce the time taken for the disk to be enumerated after a device hot insert.

The current OFA driver does the following in its initialization path, let’s say on a server which has 32 logical processors and device which supports 32 queues –

1.       Identify controller

2.       Identify namespace - for N number of namespaces

3.       Set features - Interrupt coalescing, number of queues, lba range type.

4.       Create IO completion queue - 32 commands

5.       Create IO submission queue - 32 commands

6.       LearnMapping - 32 Read commands

7.       ReSetupQueues - 32 Delete Sub queues + 32 Delete completion queues

8.       Create IO completion queue - 32 commands

9.       Create IO submission queue - 32 commands

10.   Complete initialization state machine



As can be observed, during disk initialization, around 224 commands are processed for setting up the IO queues and associate the MSI-x number to each queues. If we use StorPortInitializePerfOpts(), we required only 64 commands instead of 224 commands. On a server which as 120 logical processors, 840 commands are required for setting up the IO queues and associate the MSI-x number to each queues. If learning cores is avoided, only 240 commands are required instead of 840 commands.



Also we can fall back to learning cores if the API StorPortInitiailzePerfOpts() fails.



We see improved device up time after this change. Also, if the number of queues supported by the device is less than the number of logical processors, the driver does not execute the learning cores, hence there won’t be any improvement even if we use StorPortInitializePerfOpts().



Server with 32 logical processors:


OFA version


disk up time in seconds


Disk capacity = 400 GB


Disk capacity = 1.6 TB


Disk from vendor 1


Disk from vendor 2


Rev 133


14


6.5


14.5


Latest


5


5


13.5


PS: data may change for different vendor SSDs



Code changes:

1.     Changes w.r.t StorPortInitializePerfOpts().

a.     In NVMeInitialize(), moved the StorPortInitializePerfOpts() after NVMeEnumMsiMessages() to set the LastRedirectionMessageNumber in StorPortInitializePerfOpts().

b.     Set the flags STOR_PERF_INTERRUPT_MESSAGE_RANGES and STOR_PERF_ADV_CONFIG_LOCALITY, and values FirstRedirectionMessageNumber, LastRedirectionMessageNumber and MessageTargets in StorPortInitializePerfOpts() to get the MSIx-Core mapping in MessageTargets. If this API returns success, the learning cores can be skipped.

c.     If the StorPortInitializePerfOpts() fails, in NVMeMsiMapCores(), the mapping of msix to cores in assigned sequentially, and learning cores is executed. During learning cores, in IoCompletionRoutine(), the msix to core is re-mapped. If the StorPortInitializePerfOpts() succeeds, in NVMeMsiMapCores(), the mapping of msix to cores is taken from MessageTargets and learnig cores is skipped.



2.     When the learning cores is skipped, the controller initialization completes faster. But we have observed that in some devices, the Namespace is not ready to process I/O at this point. And when kernel send I/O, the device returns SC = NAMESPACE_NOT_READY and miniport returns SCSI_SENSEQ_BECOMING_READY, for which storport retries after some time. If the device takes too long to initialize the namespace, the storport gives up and shows as Uninitialized in the disk mgmt.
Hence the controller initialization has to be completed after Namespace is ready. For this, a new state is introduce in the NVMeRunning(), which waits till the NS is ready.

a.       Introduced a new state NVMeWaitOnNamespaceReady in NVMeRunning().

b.     In IoCompletionRoutine(), determine which CQ to look in based on WaitOnNamespaceReady state.

c.     In NVMeInitCallback(), implemented call back for NVMeWaitOnNamespaceReady.

d.     In IoCompletionRoutine(), free the read buffer for namespaceready.

Note:

a.     We have observed that higher capacity Namespaces take too long to initialize, hence the passiveTimeout value in NVMePassiveInitialize() is not sufficient. We need to increase the timeout value based on vendor requirements.

b.     b. Checking for Namespace ready is skipped during dump/hibernation mode and resume from hibernation.



3.     Usually, the number of MSIx supported by device and MSIx granted(StorPortGetMSIInfo) will be number of IO Queue + 1 Admin Queue. But, we share the Admin queue and first I/O queue in core 0, and hence MSIx 0 is shared between admin queue and first I/O queue. Incase of active cores more than Queues supported, one of the MSGID should not be considered. Made changes in In NVMeEnumMsiMessages() accordingly.
For example, cores = 32, Admin + IO queues = 1 + 8, then MsgID(in NVMeEnumMsiMessages()) should be 8.



4.     In IoCompletionRoutine(), for learning cores, only if MSIGranted is less than active cores, the QueueNo will be remapped in sequential manner. Otherwise QueueNo remains same as before.


[Image removed by sender.]

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.





















[cid:image005.gif at 01D194D5.AD00C9F0]

[Image removed by sender.]

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20160412/4f672c8d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 2934 bytes
Desc: image001.jpg
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20160412/4f672c8d/attachment.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.jpg
Type: image/jpeg
Size: 2938 bytes
Desc: image002.jpg
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20160412/4f672c8d/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.jpg
Type: image/jpeg
Size: 823 bytes
Desc: image004.jpg
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20160412/4f672c8d/attachment-0002.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.gif
Type: image/gif
Size: 13168 bytes
Desc: image005.gif
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20160412/4f672c8d/attachment.gif>


More information about the nvmewin mailing list