From suman.p at samsung.com Mon Apr 4 06:34:43 2016 From: suman.p at samsung.com (SUMAN PRAKASH B) Date: Mon, 04 Apr 2016 13:34:43 +0000 (GMT) Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Message-ID: <82.E4.04892.37D62075@epcpsbgx2.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604041905139_UIXHLLKJ.jpg Type: image/jpeg Size: 2938 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604041905144_Y5W7Z1SF.gif Type: image/gif Size: 13168 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Samsung_DiskInitPerfOpt_v2.7z Type: application/octet-stream Size: 145535 bytes Desc: not available URL: From carolyn.d.foster at intel.com Tue Apr 5 17:03:37 2016 From: carolyn.d.foster at intel.com (Foster, Carolyn D) Date: Wed, 6 Apr 2016 00:03:37 +0000 Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance In-Reply-To: <82.E4.04892.37D62075@epcpsbgx2.samsung.com> References: <82.E4.04892.37D62075@epcpsbgx2.samsung.com> Message-ID: Hi Suman, I have a few comments and suggestions: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c – We noticed on some systems we saw a degradation in performance and we suspect it’s related to this change. If we don’t share MSIX vector 0 between the admin queue and an IO queue we are creating one fewer queue to submit IO to. Did you execute any performance testing before and after these changes? I have included some details about the system configurations we tested and the observed results below. 2. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. 3. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Performance configuration and data: OS: Windows 8.1 x64 Workload: 100% sequential Read Compared the OFA trunk to the Samsung patch Summary of observed results: · System with fewer than 32 logical CPU cores: o No delta in performance observed · System with between 32 and 64 cores: o 20%-50% drop in performance observed · System with more than 64 cores: o 30%-40% drop in performance observed Carolyn From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Monday, April 04, 2016 6:35 AM To: Thomas Freeman ; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim ; anshul at samsung.com; MANOJ THAPLIYAL ; tru.nguyen at ssi.samsung.com Subject: Re: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, I am sending the updated patch incorporating feedback from Tom. The changes are listed below. The password is samsungnvme 1. Moved the StorPortFreePool()from IoCompletionRoutine() to NvmeInitCallback() - NvmeWaitOnNamespaceReady. 2. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus willnot be ONLINE if lun extension is zero'ed out. 3. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. To Mandatory reviewers: Can we get feedback or approval for this patch before 7th April. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Mar 29, 2016 20:27 (GMT+05:30) Title : Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Tom, Thanks for the review comments. Please find my replies below: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. [Suman] Agreed. We will move the StorPortFreePool to NvmeInitCallback. 2.a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. [Suman] Following changes are made: a. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus will not be ONLINE if lun extension is zero'ed out. b. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. Let me know if you have any questions. We will share the modified code once others share their feedback. Can we get feedback from other companies by 5th April? Thanks, Suman ------- Original Message ------- Sender : Thomas Freeman Date : Mar 29, 2016 00:48 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, It looks good. I have a few comments here: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. 2. I ran into a few problems, here are the details: *My device configuration: I'm testing a device that supports NS management and it has multiple namespaces. Some of those namespaces are not attached. The format of some of those namespaces is not supported by the driver (e.g. LBA contains metadata) a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. i. If that lun is a detached namespace, the READ fails with a status code of 0xb. The driver attempts to retry until the READ is successful, but the command will never succeed. ii. During initialization, if the driver detects a namespace that is in an unsupported format, it zero's out that LUN entry, but leaves that zero’ed entry in the LUN extension list. When NVMeRunningWaitOnNamespaceReady is processing the list, it does not recognize this as a zero'ed out entry. Rather is attempts a READ from this namespace (the NSID is 0 since the init code zero'ed out that Lun list entry). The READ and all of its retries fail with a status code of 0xb. Proposed fix: 1. Before issuing a READ, ensure that namespace is attached and a valid format. If not, increment the counters and move to the next Lun. 2. Also, in NVMeInitCallback, when handling the case NVMeWaitOnNamespaceReady, instead of looking for an SC of 0x00, only issue a retry if the command fails with SC = 0x82 (NS not ready). Let me know if you have any questions. Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image001.jpg at 01D18E55.4C7BA130] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Wednesday, March 23, 2016 7:27 AM To: nvmewin at lists.openfabrics.org Cc: Seokhwan Kim ; MANOJ THAPLIYAL ; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, This patch includes changes for optimizing the disk initialization performance and relevant changes. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** Disk Initialization Performance Optimization: We can use the StorPortInitializePerfOpts(), PERF_CONFIGURATION_DATA.MessageTargets which provides the array of MSI numbers corresponding to each logical processor. This is an alternative of using the Learning cores logic implemented in the OFA driver. Also this will directly reduce the time taken for the disk to be enumerated after a device hot insert. The current OFA driver does the following in its initialization path, let’s say on a server which has 32 logical processors and device which supports 32 queues – 1. Identify controller 2. Identify namespace - for N number of namespaces 3. Set features - Interrupt coalescing, number of queues, lba range type. 4. Create IO completion queue - 32 commands 5. Create IO submission queue - 32 commands 6. LearnMapping - 32 Read commands 7. ReSetupQueues - 32 Delete Sub queues + 32 Delete completion queues 8. Create IO completion queue - 32 commands 9. Create IO submission queue - 32 commands 10. Complete initialization state machine As can be observed, during disk initialization, around 224 commands are processed for setting up the IO queues and associate the MSI-x number to each queues. If we use StorPortInitializePerfOpts(), we required only 64 commands instead of 224 commands. On a server which as 120 logical processors, 840 commands are required for setting up the IO queues and associate the MSI-x number to each queues. If learning cores is avoided, only 240 commands are required instead of 840 commands. Also we can fall back to learning cores if the API StorPortInitiailzePerfOpts() fails. We see improved device up time after this change. Also, if the number of queues supported by the device is less than the number of logical processors, the driver does not execute the learning cores, hence there won’t be any improvement even if we use StorPortInitializePerfOpts(). Server with 32 logical processors: OFA version disk up time in seconds Disk capacity = 400 GB Disk capacity = 1.6 TB Disk from vendor 1 Disk from vendor 2 Rev 133 14 6.5 14.5 Latest 5 5 13.5 PS: data may change for different vendor SSDs Code changes: 1. Changes w.r.t StorPortInitializePerfOpts(). a. In NVMeInitialize(), moved the StorPortInitializePerfOpts() after NVMeEnumMsiMessages() to set the LastRedirectionMessageNumber in StorPortInitializePerfOpts(). b. Set the flags STOR_PERF_INTERRUPT_MESSAGE_RANGES and STOR_PERF_ADV_CONFIG_LOCALITY, and values FirstRedirectionMessageNumber, LastRedirectionMessageNumber and MessageTargets in StorPortInitializePerfOpts() to get the MSIx-Core mapping in MessageTargets. If this API returns success, the learning cores can be skipped. c. If the StorPortInitializePerfOpts() fails, in NVMeMsiMapCores(), the mapping of msix to cores in assigned sequentially, and learning cores is executed. During learning cores, in IoCompletionRoutine(), the msix to core is re-mapped. If the StorPortInitializePerfOpts() succeeds, in NVMeMsiMapCores(), the mapping of msix to cores is taken from MessageTargets and learnig cores is skipped. 2. When the learning cores is skipped, the controller initialization completes faster. But we have observed that in some devices, the Namespace is not ready to process I/O at this point. And when kernel send I/O, the device returns SC = NAMESPACE_NOT_READY and miniport returns SCSI_SENSEQ_BECOMING_READY, for which storport retries after some time. If the device takes too long to initialize the namespace, the storport gives up and shows as Uninitialized in the disk mgmt. Hence the controller initialization has to be completed after Namespace is ready. For this, a new state is introduce in the NVMeRunning(), which waits till the NS is ready. a. Introduced a new state NVMeWaitOnNamespaceReady in NVMeRunning(). b. In IoCompletionRoutine(), determine which CQ to look in based on WaitOnNamespaceReady state. c. In NVMeInitCallback(), implemented call back for NVMeWaitOnNamespaceReady. d. In IoCompletionRoutine(), free the read buffer for namespaceready. Note: a. We have observed that higher capacity Namespaces take too long to initialize, hence the passiveTimeout value in NVMePassiveInitialize() is not sufficient. We need to increase the timeout value based on vendor requirements. b. b. Checking for Namespace ready is skipped during dump/hibernation mode and resume from hibernation. 3. Usually, the number of MSIx supported by device and MSIx granted(StorPortGetMSIInfo) will be number of IO Queue + 1 Admin Queue. But, we share the Admin queue and first I/O queue in core 0, and hence MSIx 0 is shared between admin queue and first I/O queue. Incase of active cores more than Queues supported, one of the MSGID should not be considered. Made changes in In NVMeEnumMsiMessages() accordingly. For example, cores = 32, Admin + IO queues = 1 + 8, then MsgID(in NVMeEnumMsiMessages()) should be 8. 4. In IoCompletionRoutine(), for learning cores, only if MSIGranted is less than active cores, the QueueNo will be remapped in sequential manner. Otherwise QueueNo remains same as before. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [cid:image002.gif at 01D18E55.4C7BA130] [Image removed by sender.] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ~WRD000.jpg Type: image/jpeg Size: 823 bytes Desc: ~WRD000.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 2938 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 13168 bytes Desc: image002.gif URL: From judy.brock at ssi.samsung.com Tue Apr 5 17:18:41 2016 From: judy.brock at ssi.samsung.com (Judy Brock-SSI) Date: Wed, 06 Apr 2016 00:18:41 +0000 Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance In-Reply-To: References: <82.E4.04892.37D62075@epcpsbgx2.samsung.com> Message-ID: <36E8D38D6B771A4BBDB1C0D800158A51865F42FB@SSIEXCH-MB3.ssi.samsung.com> Hi Carolyn, Can you provide performance test details? Thanks, Judy From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Foster, Carolyn D Sent: Tuesday, April 05, 2016 5:04 PM To: SUMAN PRAKASH B; Thomas Freeman; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim; anshul at samsung.com; MANOJ THAPLIYAL; Truong Nguyen-SSI Subject: Re: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, I have a few comments and suggestions: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c – We noticed on some systems we saw a degradation in performance and we suspect it’s related to this change. If we don’t share MSIX vector 0 between the admin queue and an IO queue we are creating one fewer queue to submit IO to. Did you execute any performance testing before and after these changes? I have included some details about the system configurations we tested and the observed results below. 2. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. 3. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Performance configuration and data: OS: Windows 8.1 x64 Workload: 100% sequential Read Compared the OFA trunk to the Samsung patch Summary of observed results: • System with fewer than 32 logical CPU cores: o No delta in performance observed • System with between 32 and 64 cores: o 20%-50% drop in performance observed • System with more than 64 cores: o 30%-40% drop in performance observed Carolyn From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Monday, April 04, 2016 6:35 AM To: Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; anshul at samsung.com; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: Re: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, I am sending the updated patch incorporating feedback from Tom. The changes are listed below. The password is samsungnvme 1. Moved the StorPortFreePool()from IoCompletionRoutine() to NvmeInitCallback() - NvmeWaitOnNamespaceReady. 2. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus willnot be ONLINE if lun extension is zero'ed out. 3. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. To Mandatory reviewers: Can we get feedback or approval for this patch before 7th April. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Mar 29, 2016 20:27 (GMT+05:30) Title : Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Tom, Thanks for the review comments. Please find my replies below: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. [Suman] Agreed. We will move the StorPortFreePool to NvmeInitCallback. 2.a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. [Suman] Following changes are made: a. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus will not be ONLINE if lun extension is zero'ed out. b. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. Let me know if you have any questions. We will share the modified code once others share their feedback. Can we get feedback from other companies by 5th April? Thanks, Suman ------- Original Message ------- Sender : Thomas Freeman> Date : Mar 29, 2016 00:48 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, It looks good. I have a few comments here: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. 2. I ran into a few problems, here are the details: *My device configuration: I'm testing a device that supports NS management and it has multiple namespaces. Some of those namespaces are not attached. The format of some of those namespaces is not supported by the driver (e.g. LBA contains metadata) a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. i. If that lun is a detached namespace, the READ fails with a status code of 0xb. The driver attempts to retry until the READ is successful, but the command will never succeed. ii. During initialization, if the driver detects a namespace that is in an unsupported format, it zero's out that LUN entry, but leaves that zero’ed entry in the LUN extension list. When NVMeRunningWaitOnNamespaceReady is processing the list, it does not recognize this as a zero'ed out entry. Rather is attempts a READ from this namespace (the NSID is 0 since the init code zero'ed out that Lun list entry). The READ and all of its retries fail with a status code of 0xb. Proposed fix: 1. Before issuing a READ, ensure that namespace is attached and a valid format. If not, increment the counters and move to the next Lun. 2. Also, in NVMeInitCallback, when handling the case NVMeWaitOnNamespaceReady, instead of looking for an SC of 0x00, only issue a retry if the command fails with SC = 0x82 (NS not ready). Let me know if you have any questions. Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image001.jpg at 01D18E55.4C7BA130] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Wednesday, March 23, 2016 7:27 AM To: nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, This patch includes changes for optimizing the disk initialization performance and relevant changes. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** Disk Initialization Performance Optimization: We can use the StorPortInitializePerfOpts(), PERF_CONFIGURATION_DATA.MessageTargets which provides the array of MSI numbers corresponding to each logical processor. This is an alternative of using the Learning cores logic implemented in the OFA driver. Also this will directly reduce the time taken for the disk to be enumerated after a device hot insert. The current OFA driver does the following in its initialization path, let’s say on a server which has 32 logical processors and device which supports 32 queues – 1. Identify controller 2. Identify namespace - for N number of namespaces 3. Set features - Interrupt coalescing, number of queues, lba range type. 4. Create IO completion queue - 32 commands 5. Create IO submission queue - 32 commands 6. LearnMapping - 32 Read commands 7. ReSetupQueues - 32 Delete Sub queues + 32 Delete completion queues 8. Create IO completion queue - 32 commands 9. Create IO submission queue - 32 commands 10. Complete initialization state machine As can be observed, during disk initialization, around 224 commands are processed for setting up the IO queues and associate the MSI-x number to each queues. If we use StorPortInitializePerfOpts(), we required only 64 commands instead of 224 commands. On a server which as 120 logical processors, 840 commands are required for setting up the IO queues and associate the MSI-x number to each queues. If learning cores is avoided, only 240 commands are required instead of 840 commands. Also we can fall back to learning cores if the API StorPortInitiailzePerfOpts() fails. We see improved device up time after this change. Also, if the number of queues supported by the device is less than the number of logical processors, the driver does not execute the learning cores, hence there won’t be any improvement even if we use StorPortInitializePerfOpts(). Server with 32 logical processors: OFA version disk up time in seconds Disk capacity = 400 GB Disk capacity = 1.6 TB Disk from vendor 1 Disk from vendor 2 Rev 133 14 6.5 14.5 Latest 5 5 13.5 PS: data may change for different vendor SSDs Code changes: 1. Changes w.r.t StorPortInitializePerfOpts(). a. In NVMeInitialize(), moved the StorPortInitializePerfOpts() after NVMeEnumMsiMessages() to set the LastRedirectionMessageNumber in StorPortInitializePerfOpts(). b. Set the flags STOR_PERF_INTERRUPT_MESSAGE_RANGES and STOR_PERF_ADV_CONFIG_LOCALITY, and values FirstRedirectionMessageNumber, LastRedirectionMessageNumber and MessageTargets in StorPortInitializePerfOpts() to get the MSIx-Core mapping in MessageTargets. If this API returns success, the learning cores can be skipped. c. If the StorPortInitializePerfOpts() fails, in NVMeMsiMapCores(), the mapping of msix to cores in assigned sequentially, and learning cores is executed. During learning cores, in IoCompletionRoutine(), the msix to core is re-mapped. If the StorPortInitializePerfOpts() succeeds, in NVMeMsiMapCores(), the mapping of msix to cores is taken from MessageTargets and learnig cores is skipped. 2. When the learning cores is skipped, the controller initialization completes faster. But we have observed that in some devices, the Namespace is not ready to process I/O at this point. And when kernel send I/O, the device returns SC = NAMESPACE_NOT_READY and miniport returns SCSI_SENSEQ_BECOMING_READY, for which storport retries after some time. If the device takes too long to initialize the namespace, the storport gives up and shows as Uninitialized in the disk mgmt. Hence the controller initialization has to be completed after Namespace is ready. For this, a new state is introduce in the NVMeRunning(), which waits till the NS is ready. a. Introduced a new state NVMeWaitOnNamespaceReady in NVMeRunning(). b. In IoCompletionRoutine(), determine which CQ to look in based on WaitOnNamespaceReady state. c. In NVMeInitCallback(), implemented call back for NVMeWaitOnNamespaceReady. d. In IoCompletionRoutine(), free the read buffer for namespaceready. Note: a. We have observed that higher capacity Namespaces take too long to initialize, hence the passiveTimeout value in NVMePassiveInitialize() is not sufficient. We need to increase the timeout value based on vendor requirements. b. b. Checking for Namespace ready is skipped during dump/hibernation mode and resume from hibernation. 3. Usually, the number of MSIx supported by device and MSIx granted(StorPortGetMSIInfo) will be number of IO Queue + 1 Admin Queue. But, we share the Admin queue and first I/O queue in core 0, and hence MSIx 0 is shared between admin queue and first I/O queue. Incase of active cores more than Queues supported, one of the MSGID should not be considered. Made changes in In NVMeEnumMsiMessages() accordingly. For example, cores = 32, Admin + IO queues = 1 + 8, then MsgID(in NVMeEnumMsiMessages()) should be 8. 4. In IoCompletionRoutine(), for learning cores, only if MSIGranted is less than active cores, the QueueNo will be remapped in sequential manner. Otherwise QueueNo remains same as before. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [cid:image002.gif at 01D18E55.4C7BA130] [Image removed by sender.] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 2938 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.jpg Type: image/jpeg Size: 823 bytes Desc: image003.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.gif Type: image/gif Size: 13168 bytes Desc: image004.gif URL: From suman.p at samsung.com Wed Apr 6 09:01:14 2016 From: suman.p at samsung.com (SUMAN PRAKASH B) Date: Wed, 06 Apr 2016 16:01:14 +0000 (GMT) Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604062132574_3IN37TXX.jpg Type: image/jpeg Size: 2938 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604062132582_PZ28TIC2.jpg Type: image/jpeg Size: 823 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604062132588_0EB565ZP.gif Type: image/gif Size: 13168 bytes Desc: not available URL: From carolyn.d.foster at intel.com Wed Apr 6 14:53:44 2016 From: carolyn.d.foster at intel.com (Foster, Carolyn D) Date: Wed, 6 Apr 2016 21:53:44 +0000 Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance In-Reply-To: <72.EC.04892.9C235075@epcpsbgx2.samsung.com> References: <72.EC.04892.9C235075@epcpsbgx2.samsung.com> Message-ID: Hi Suman, thank you for the clarification. I will confirm the rest of the workload details and have that information for you tomorrow. In the mean time I will also rerun our performance tests to confirm that the results are reproducible, and will run the tests without line 594 as you suggested. Carolyn From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Wednesday, April 06, 2016 9:01 AM To: Foster, Carolyn D ; Thomas Freeman ; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim ; ANSHUL SHARMA ; MANOJ THAPLIYAL ; tru.nguyen at ssi.samsung.com Subject: Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Carolyn, Thanks for the comments and suggestions. Please find my comments below: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c a. We tested on Servers with 32 and 64 logical processors, Windows 8.1 and 2012 R2 OS, multiple vendor devices with both 1-to-1 core-queue mapping and many-to-1 core-queue mapping, with both R133 and latest drivers, and we did not observe any performance drop for 100% Sequential Read(128K) with 32 and 64 worker threads, and queue depth 32. We have tested both the StorPortInitializePerfOpts() pass case and fail case(learning cores will be executed). b. Regarding MsgID-- in nvmeInit.c line number 594, NVMe device supports number of msg ids equal to 1 admin + N IO queues. But in OFA driver, since the msg id 0 is shared between admin queue and io queue, always ((1 admin + N io queues) - 1) number of msgids is used. With MsgID--, we make sure all IO queues are created with unique msg id and msg id 0 is shared with admin queue and 1 io queue. If the device that you are testing has total number of msg ids equal to 1 admin queue + N io queues, then MsgID-- should not be a problem. But if you strongly feel that MsgID-- could be an issue, can you please take the perf benchmark after removing MsgID--. c. Can you please let us know the number of queues and number of Messaged IDs supported by the target device that you are testing with. d. On servers, we usually test on Windows Server edition OSes. When we tested with Windows 8.1, we observed that the number of logical processors supported in Windows 8.1 is maximum 32, even when the server have more than 32 logical processors. We will try to reproduce the performance degradation behavior, meanwhile can you please provide us more debug data. 2. nvmeStat.c @ line 784 : Agreed. We will change as suggested. 3. nvmeStat.c @line 899 : Agreed. We will change as suggested. Thanks, Suman ------- Original Message ------- Sender : Foster, Carolyn D> Date : Apr 06, 2016 05:33 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, I have a few comments and suggestions: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c – We noticed on some systems we saw a degradation in performance and we suspect it’s related to this change. If we don’t share MSIX vector 0 between the admin queue and an IO queue we are creating one fewer queue to submit IO to. Did you execute any performance testing before and after these changes? I have included some details about the system configurations we tested and the observed results below. 2. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. 3. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Performance configuration and data: OS: Windows 8.1 x64 Workload: 100% sequential Read Compared the OFA trunk to the Samsung patch Summary of observed results: • System with fewer than 32 logical CPU cores: o No delta in performance observed • System with between 32 and 64 cores: o 20%-50% drop in performance observed • System with more than 64 cores: o 30%-40% drop in performance observed Carolyn From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Monday, April 04, 2016 6:35 AM To: Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; anshul at samsung.com; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: Re: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, I am sending the updated patch incorporating feedback from Tom. The changes are listed below. The password is samsungnvme 1. Moved the StorPortFreePool()from IoCompletionRoutine() to NvmeInitCallback() - NvmeWaitOnNamespaceReady. 2. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus willnot be ONLINE if lun extension is zero'ed out. 3. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. To Mandatory reviewers: Can we get feedback or approval for this patch before 7th April. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Mar 29, 2016 20:27 (GMT+05:30) Title : Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Tom, Thanks for the review comments. Please find my replies below: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. [Suman] Agreed. We will move the StorPortFreePool to NvmeInitCallback. 2.a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. [Suman] Following changes are made: a. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus will not be ONLINE if lun extension is zero'ed out. b. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. Let me know if you have any questions. We will share the modified code once others share their feedback. Can we get feedback from other companies by 5th April? Thanks, Suman ------- Original Message ------- Sender : Thomas Freeman> Date : Mar 29, 2016 00:48 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, It looks good. I have a few comments here: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. 2. I ran into a few problems, here are the details: *My device configuration: I'm testing a device that supports NS management and it has multiple namespaces. Some of those namespaces are not attached. The format of some of those namespaces is not supported by the driver (e.g. LBA contains metadata) a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. i. If that lun is a detached namespace, the READ fails with a status code of 0xb. The driver attempts to retry until the READ is successful, but the command will never succeed. ii. During initialization, if the driver detects a namespace that is in an unsupported format, it zero's out that LUN entry, but leaves that zero’ed entry in the LUN extension list. When NVMeRunningWaitOnNamespaceReady is processing the list, it does not recognize this as a zero'ed out entry. Rather is attempts a READ from this namespace (the NSID is 0 since the init code zero'ed out that Lun list entry). The READ and all of its retries fail with a status code of 0xb. Proposed fix: 1. Before issuing a READ, ensure that namespace is attached and a valid format. If not, increment the counters and move to the next Lun. 2. Also, in NVMeInitCallback, when handling the case NVMeWaitOnNamespaceReady, instead of looking for an SC of 0x00, only issue a retry if the command fails with SC = 0x82 (NS not ready). Let me know if you have any questions. Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [http://www.samsung.net/service/ml/AttachController/image001.jpg?cmd=downdirectly&filepath=/LOCAL/ML/CACHE/s/20160406/image001.jpg at 01D18E55.4C7BA130309suman.p&contentType=IMAGE/JPEG;charset=UTF8&msgno=309&partno=2&foldername=INBOX&msguid=48143] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Wednesday, March 23, 2016 7:27 AM To: nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, This patch includes changes for optimizing the disk initialization performance and relevant changes. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** Disk Initialization Performance Optimization: We can use the StorPortInitializePerfOpts(), PERF_CONFIGURATION_DATA.MessageTargets which provides the array of MSI numbers corresponding to each logical processor. This is an alternative of using the Learning cores logic implemented in the OFA driver. Also this will directly reduce the time taken for the disk to be enumerated after a device hot insert. The current OFA driver does the following in its initialization path, let’s say on a server which has 32 logical processors and device which supports 32 queues – 1. Identify controller 2. Identify namespace - for N number of namespaces 3. Set features - Interrupt coalescing, number of queues, lba range type. 4. Create IO completion queue - 32 commands 5. Create IO submission queue - 32 commands 6. LearnMapping - 32 Read commands 7. ReSetupQueues - 32 Delete Sub queues + 32 Delete completion queues 8. Create IO completion queue - 32 commands 9. Create IO submission queue - 32 commands 10. Complete initialization state machine As can be observed, during disk initialization, around 224 commands are processed for setting up the IO queues and associate the MSI-x number to each queues. If we use StorPortInitializePerfOpts(), we required only 64 commands instead of 224 commands. On a server which as 120 logical processors, 840 commands are required for setting up the IO queues and associate the MSI-x number to each queues. If learning cores is avoided, only 240 commands are required instead of 840 commands. Also we can fall back to learning cores if the API StorPortInitiailzePerfOpts() fails. We see improved device up time after this change. Also, if the number of queues supported by the device is less than the number of logical processors, the driver does not execute the learning cores, hence there won’t be any improvement even if we use StorPortInitializePerfOpts(). Server with 32 logical processors: OFA version disk up time in seconds Disk capacity = 400 GB Disk capacity = 1.6 TB Disk from vendor 1 Disk from vendor 2 Rev 133 14 6.5 14.5 Latest 5 5 13.5 PS: data may change for different vendor SSDs Code changes: 1. Changes w.r.t StorPortInitializePerfOpts(). a. In NVMeInitialize(), moved the StorPortInitializePerfOpts() after NVMeEnumMsiMessages() to set the LastRedirectionMessageNumber in StorPortInitializePerfOpts(). b. Set the flags STOR_PERF_INTERRUPT_MESSAGE_RANGES and STOR_PERF_ADV_CONFIG_LOCALITY, and values FirstRedirectionMessageNumber, LastRedirectionMessageNumber and MessageTargets in StorPortInitializePerfOpts() to get the MSIx-Core mapping in MessageTargets. If this API returns success, the learning cores can be skipped. c. If the StorPortInitializePerfOpts() fails, in NVMeMsiMapCores(), the mapping of msix to cores in assigned sequentially, and learning cores is executed. During learning cores, in IoCompletionRoutine(), the msix to core is re-mapped. If the StorPortInitializePerfOpts() succeeds, in NVMeMsiMapCores(), the mapping of msix to cores is taken from MessageTargets and learnig cores is skipped. 2. When the learning cores is skipped, the controller initialization completes faster. But we have observed that in some devices, the Namespace is not ready to process I/O at this point. And when kernel send I/O, the device returns SC = NAMESPACE_NOT_READY and miniport returns SCSI_SENSEQ_BECOMING_READY, for which storport retries after some time. If the device takes too long to initialize the namespace, the storport gives up and shows as Uninitialized in the disk mgmt. Hence the controller initialization has to be completed after Namespace is ready. For this, a new state is introduce in the NVMeRunning(), which waits till the NS is ready. a. Introduced a new state NVMeWaitOnNamespaceReady in NVMeRunning(). b. In IoCompletionRoutine(), determine which CQ to look in based on WaitOnNamespaceReady state. c. In NVMeInitCallback(), implemented call back for NVMeWaitOnNamespaceReady. d. In IoCompletionRoutine(), free the read buffer for namespaceready. Note: a. We have observed that higher capacity Namespaces take too long to initialize, hence the passiveTimeout value in NVMePassiveInitialize() is not sufficient. We need to increase the timeout value based on vendor requirements. b. b. Checking for Namespace ready is skipped during dump/hibernation mode and resume from hibernation. 3. Usually, the number of MSIx supported by device and MSIx granted(StorPortGetMSIInfo) will be number of IO Queue + 1 Admin Queue. But, we share the Admin queue and first I/O queue in core 0, and hence MSIx 0 is shared between admin queue and first I/O queue. Incase of active cores more than Queues supported, one of the MSGID should not be considered. Made changes in In NVMeEnumMsiMessages() accordingly. For example, cores = 32, Admin + IO queues = 1 + 8, then MsgID(in NVMeEnumMsiMessages()) should be 8. 4. In IoCompletionRoutine(), for learning cores, only if MSIGranted is less than active cores, the QueueNo will be remapped in sequential manner. Otherwise QueueNo remains same as before. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [cid:image003.gif at 01D19014.0E9841A0] [http://ext.samsung.net/mailcheck/SeenTimeChecker?do=5ffd028a7cd232fe9761cd2cae8791eea8b881028604b683b8a3fd6b4a260c0ed33a9d35f6e1735f20a30c65ae77ad69c7b41e955949e5c8a728c55b39cc59eacf878f9a26ce15a0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 2938 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 823 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 13168 bytes Desc: image003.gif URL: From carolyn.d.foster at intel.com Thu Apr 7 14:12:41 2016 From: carolyn.d.foster at intel.com (Foster, Carolyn D) Date: Thu, 7 Apr 2016 21:12:41 +0000 Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance References: <72.EC.04892.9C235075@epcpsbgx2.samsung.com> Message-ID: Hi Suman, I have an update on our performance data. It appears on systems with more than 32 and more than 64 cores that there was no performance delta in the following workloads: 4 workers, 32 Queue Depth 8 workers, 16 Queue Depth We did observe what looks like a more consistent (and much smaller) 10% drop on workload with a single worker, single queue depth, on systems with 32 cores. It seems that our initial results might have been flawed and based on your comments and performance analysis, there may be no major issue here. Carolyn From: Foster, Carolyn D Sent: Wednesday, April 06, 2016 2:54 PM To: 'suman.p at samsung.com' ; Thomas Freeman ; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim ; ANSHUL SHARMA ; MANOJ THAPLIYAL ; tru.nguyen at ssi.samsung.com Subject: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, thank you for the clarification. I will confirm the rest of the workload details and have that information for you tomorrow. In the mean time I will also rerun our performance tests to confirm that the results are reproducible, and will run the tests without line 594 as you suggested. Carolyn From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Wednesday, April 06, 2016 9:01 AM To: Foster, Carolyn D >; Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; ANSHUL SHARMA >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Carolyn, Thanks for the comments and suggestions. Please find my comments below: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c a. We tested on Servers with 32 and 64 logical processors, Windows 8.1 and 2012 R2 OS, multiple vendor devices with both 1-to-1 core-queue mapping and many-to-1 core-queue mapping, with both R133 and latest drivers, and we did not observe any performance drop for 100% Sequential Read(128K) with 32 and 64 worker threads, and queue depth 32. We have tested both the StorPortInitializePerfOpts() pass case and fail case(learning cores will be executed). b. Regarding MsgID-- in nvmeInit.c line number 594, NVMe device supports number of msg ids equal to 1 admin + N IO queues. But in OFA driver, since the msg id 0 is shared between admin queue and io queue, always ((1 admin + N io queues) - 1) number of msgids is used. With MsgID--, we make sure all IO queues are created with unique msg id and msg id 0 is shared with admin queue and 1 io queue. If the device that you are testing has total number of msg ids equal to 1 admin queue + N io queues, then MsgID-- should not be a problem. But if you strongly feel that MsgID-- could be an issue, can you please take the perf benchmark after removing MsgID--. c. Can you please let us know the number of queues and number of Messaged IDs supported by the target device that you are testing with. d. On servers, we usually test on Windows Server edition OSes. When we tested with Windows 8.1, we observed that the number of logical processors supported in Windows 8.1 is maximum 32, even when the server have more than 32 logical processors. We will try to reproduce the performance degradation behavior, meanwhile can you please provide us more debug data. 2. nvmeStat.c @ line 784 : Agreed. We will change as suggested. 3. nvmeStat.c @line 899 : Agreed. We will change as suggested. Thanks, Suman ------- Original Message ------- Sender : Foster, Carolyn D> Date : Apr 06, 2016 05:33 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, I have a few comments and suggestions: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c – We noticed on some systems we saw a degradation in performance and we suspect it’s related to this change. If we don’t share MSIX vector 0 between the admin queue and an IO queue we are creating one fewer queue to submit IO to. Did you execute any performance testing before and after these changes? I have included some details about the system configurations we tested and the observed results below. 2. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. 3. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Performance configuration and data: OS: Windows 8.1 x64 Workload: 100% sequential Read Compared the OFA trunk to the Samsung patch Summary of observed results: • System with fewer than 32 logical CPU cores: o No delta in performance observed • System with between 32 and 64 cores: o 20%-50% drop in performance observed • System with more than 64 cores: o 30%-40% drop in performance observed Carolyn From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Monday, April 04, 2016 6:35 AM To: Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; anshul at samsung.com; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: Re: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, I am sending the updated patch incorporating feedback from Tom. The changes are listed below. The password is samsungnvme 1. Moved the StorPortFreePool()from IoCompletionRoutine() to NvmeInitCallback() - NvmeWaitOnNamespaceReady. 2. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus willnot be ONLINE if lun extension is zero'ed out. 3. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. To Mandatory reviewers: Can we get feedback or approval for this patch before 7th April. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Mar 29, 2016 20:27 (GMT+05:30) Title : Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Tom, Thanks for the review comments. Please find my replies below: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. [Suman] Agreed. We will move the StorPortFreePool to NvmeInitCallback. 2.a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. [Suman] Following changes are made: a. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus will not be ONLINE if lun extension is zero'ed out. b. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. Let me know if you have any questions. We will share the modified code once others share their feedback. Can we get feedback from other companies by 5th April? Thanks, Suman ------- Original Message ------- Sender : Thomas Freeman> Date : Mar 29, 2016 00:48 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, It looks good. I have a few comments here: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. 2. I ran into a few problems, here are the details: *My device configuration: I'm testing a device that supports NS management and it has multiple namespaces. Some of those namespaces are not attached. The format of some of those namespaces is not supported by the driver (e.g. LBA contains metadata) a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. i. If that lun is a detached namespace, the READ fails with a status code of 0xb. The driver attempts to retry until the READ is successful, but the command will never succeed. ii. During initialization, if the driver detects a namespace that is in an unsupported format, it zero's out that LUN entry, but leaves that zero’ed entry in the LUN extension list. When NVMeRunningWaitOnNamespaceReady is processing the list, it does not recognize this as a zero'ed out entry. Rather is attempts a READ from this namespace (the NSID is 0 since the init code zero'ed out that Lun list entry). The READ and all of its retries fail with a status code of 0xb. Proposed fix: 1. Before issuing a READ, ensure that namespace is attached and a valid format. If not, increment the counters and move to the next Lun. 2. Also, in NVMeInitCallback, when handling the case NVMeWaitOnNamespaceReady, instead of looking for an SC of 0x00, only issue a retry if the command fails with SC = 0x82 (NS not ready). Let me know if you have any questions. Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [http://www.samsung.net/service/ml/AttachController/image001.jpg?cmd=downdirectly&filepath=/LOCAL/ML/CACHE/s/20160406/image001.jpg at 01D18E55.4C7BA130309suman.p&contentType=IMAGE/JPEG;charset=UTF8&msgno=309&partno=2&foldername=INBOX&msguid=48143] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Wednesday, March 23, 2016 7:27 AM To: nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, This patch includes changes for optimizing the disk initialization performance and relevant changes. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** Disk Initialization Performance Optimization: We can use the StorPortInitializePerfOpts(), PERF_CONFIGURATION_DATA.MessageTargets which provides the array of MSI numbers corresponding to each logical processor. This is an alternative of using the Learning cores logic implemented in the OFA driver. Also this will directly reduce the time taken for the disk to be enumerated after a device hot insert. The current OFA driver does the following in its initialization path, let’s say on a server which has 32 logical processors and device which supports 32 queues – 1. Identify controller 2. Identify namespace - for N number of namespaces 3. Set features - Interrupt coalescing, number of queues, lba range type. 4. Create IO completion queue - 32 commands 5. Create IO submission queue - 32 commands 6. LearnMapping - 32 Read commands 7. ReSetupQueues - 32 Delete Sub queues + 32 Delete completion queues 8. Create IO completion queue - 32 commands 9. Create IO submission queue - 32 commands 10. Complete initialization state machine As can be observed, during disk initialization, around 224 commands are processed for setting up the IO queues and associate the MSI-x number to each queues. If we use StorPortInitializePerfOpts(), we required only 64 commands instead of 224 commands. On a server which as 120 logical processors, 840 commands are required for setting up the IO queues and associate the MSI-x number to each queues. If learning cores is avoided, only 240 commands are required instead of 840 commands. Also we can fall back to learning cores if the API StorPortInitiailzePerfOpts() fails. We see improved device up time after this change. Also, if the number of queues supported by the device is less than the number of logical processors, the driver does not execute the learning cores, hence there won’t be any improvement even if we use StorPortInitializePerfOpts(). Server with 32 logical processors: OFA version disk up time in seconds Disk capacity = 400 GB Disk capacity = 1.6 TB Disk from vendor 1 Disk from vendor 2 Rev 133 14 6.5 14.5 Latest 5 5 13.5 PS: data may change for different vendor SSDs Code changes: 1. Changes w.r.t StorPortInitializePerfOpts(). a. In NVMeInitialize(), moved the StorPortInitializePerfOpts() after NVMeEnumMsiMessages() to set the LastRedirectionMessageNumber in StorPortInitializePerfOpts(). b. Set the flags STOR_PERF_INTERRUPT_MESSAGE_RANGES and STOR_PERF_ADV_CONFIG_LOCALITY, and values FirstRedirectionMessageNumber, LastRedirectionMessageNumber and MessageTargets in StorPortInitializePerfOpts() to get the MSIx-Core mapping in MessageTargets. If this API returns success, the learning cores can be skipped. c. If the StorPortInitializePerfOpts() fails, in NVMeMsiMapCores(), the mapping of msix to cores in assigned sequentially, and learning cores is executed. During learning cores, in IoCompletionRoutine(), the msix to core is re-mapped. If the StorPortInitializePerfOpts() succeeds, in NVMeMsiMapCores(), the mapping of msix to cores is taken from MessageTargets and learnig cores is skipped. 2. When the learning cores is skipped, the controller initialization completes faster. But we have observed that in some devices, the Namespace is not ready to process I/O at this point. And when kernel send I/O, the device returns SC = NAMESPACE_NOT_READY and miniport returns SCSI_SENSEQ_BECOMING_READY, for which storport retries after some time. If the device takes too long to initialize the namespace, the storport gives up and shows as Uninitialized in the disk mgmt. Hence the controller initialization has to be completed after Namespace is ready. For this, a new state is introduce in the NVMeRunning(), which waits till the NS is ready. a. Introduced a new state NVMeWaitOnNamespaceReady in NVMeRunning(). b. In IoCompletionRoutine(), determine which CQ to look in based on WaitOnNamespaceReady state. c. In NVMeInitCallback(), implemented call back for NVMeWaitOnNamespaceReady. d. In IoCompletionRoutine(), free the read buffer for namespaceready. Note: a. We have observed that higher capacity Namespaces take too long to initialize, hence the passiveTimeout value in NVMePassiveInitialize() is not sufficient. We need to increase the timeout value based on vendor requirements. b. b. Checking for Namespace ready is skipped during dump/hibernation mode and resume from hibernation. 3. Usually, the number of MSIx supported by device and MSIx granted(StorPortGetMSIInfo) will be number of IO Queue + 1 Admin Queue. But, we share the Admin queue and first I/O queue in core 0, and hence MSIx 0 is shared between admin queue and first I/O queue. Incase of active cores more than Queues supported, one of the MSGID should not be considered. Made changes in In NVMeEnumMsiMessages() accordingly. For example, cores = 32, Admin + IO queues = 1 + 8, then MsgID(in NVMeEnumMsiMessages()) should be 8. 4. In IoCompletionRoutine(), for learning cores, only if MSIGranted is less than active cores, the QueueNo will be remapped in sequential manner. Otherwise QueueNo remains same as before. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [cid:image003.gif at 01D190D5.DF18E700] [http://ext.samsung.net/mailcheck/SeenTimeChecker?do=5ffd028a7cd232fe9761cd2cae8791eea8b881028604b683b8a3fd6b4a260c0ed33a9d35f6e1735f20a30c65ae77ad69c7b41e955949e5c8a728c55b39cc59eacf878f9a26ce15a0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 2938 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 823 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 13168 bytes Desc: image003.gif URL: From suman.p at samsung.com Mon Apr 11 08:52:29 2016 From: suman.p at samsung.com (SUMAN PRAKASH B) Date: Mon, 11 Apr 2016 15:52:29 +0000 (GMT) Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Message-ID: <4C.A9.04907.D38CB075@epcpsbgx1.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604112123244_Y5W7Z1SF.jpg Type: image/jpeg Size: 2938 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604112123250_QZUWXYH6.jpg Type: image/jpeg Size: 823 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604112123256_9220TQUP.gif Type: image/gif Size: 13168 bytes Desc: not available URL: From suman.p at samsung.com Tue Apr 12 06:24:06 2016 From: suman.p at samsung.com (SUMAN PRAKASH B) Date: Tue, 12 Apr 2016 13:24:06 +0000 (GMT) Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604121855671_Z5JE7EUA.jpg Type: image/jpeg Size: 2938 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604121855679_LK7CT9SZ.jpg Type: image/jpeg Size: 823 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604121855685_BSL8PYMC.gif Type: image/gif Size: 13168 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Samsung_DiskInitPerfOpt_v3.7z Type: application/octet-stream Size: 145681 bytes Desc: not available URL: From raymond.c.robles at intel.com Tue Apr 12 12:44:03 2016 From: raymond.c.robles at intel.com (Robles, Raymond C) Date: Tue, 12 Apr 2016 19:44:03 +0000 Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance In-Reply-To: References: Message-ID: <49158E750348AA499168FD41D8898360726B2D7D@fmsmsx117.amr.corp.intel.com> Thanks Suman! Tom, have you had a chance to review the patch? I’d like to close this patch out by the end of next week. Thanks! From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Tuesday, April 12, 2016 6:24 AM To: Foster, Carolyn D ; Thomas Freeman ; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim ; prakash.v at samsung.com; ANSHUL SHARMA ; MANOJ THAPLIYAL ; tru.nguyen at ssi.samsung.com Subject: Re: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi All, I am sending the updated patch incorporating feedback from Carolyn. The changes are listed below. The password is samsungnvme 1. In NVMeRunningWaitOnLearnMapping(), if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. 2. In NVMeRunningWaitOnNamespaceReady(), for crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Please let me know if you have any questions. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Apr 11, 2016 21:22 (GMT+05:30) Title : Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Carolyn, Thanks for testing our patch. We tried the following: a. Sequential read performance on single worker, single queue depth, on system with 32 cores, and did not observe performance drop compared to R133. b. Assuming that the IOs are getting scattered across multiple NUMA nodes and due to remote memory access, the 10% drop could have been observed, we affinitized the performance tool on NUMA node 1, assuming that the IO queue is created on NUMA node 0. After affinitization, application submits IO on node 1 and driver processes the IOs in node 0. But still, we did not observe any performance drop. As there are no major issues here, we will incorporate the following review comments and share the revised patch tomorrow: a. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. b. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Thanks, Suman ------- Original Message ------- Sender : Foster, Carolyn D> Date : Apr 08, 2016 02:42 (GMT+05:30) Title : RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, I have an update on our performance data. It appears on systems with more than 32 and more than 64 cores that there was no performance delta in the following workloads: 4 workers, 32 Queue Depth 8 workers, 16 Queue Depth We did observe what looks like a more consistent (and much smaller) 10% drop on workload with a single worker, single queue depth, on systems with 32 cores. It seems that our initial results might have been flawed and based on your comments and performance analysis, there may be no major issue here. Carolyn From: Foster, Carolyn D Sent: Wednesday, April 06, 2016 2:54 PM To: 'suman.p at samsung.com' >; Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; ANSHUL SHARMA >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, thank you for the clarification. I will confirm the rest of the workload details and have that information for you tomorrow. In the mean time I will also rerun our performance tests to confirm that the results are reproducible, and will run the tests without line 594 as you suggested. Carolyn From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Wednesday, April 06, 2016 9:01 AM To: Foster, Carolyn D >; Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; ANSHUL SHARMA >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Carolyn, Thanks for the comments and suggestions. Please find my comments below: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c a. We tested on Servers with 32 and 64 logical processors, Windows 8.1 and 2012 R2 OS, multiple vendor devices with both 1-to-1 core-queue mapping and many-to-1 core-queue mapping, with both R133 and latest drivers, and we did not observe any performance drop for 100% Sequential Read(128K) with 32 and 64 worker threads, and queue depth 32. We have tested both the StorPortInitializePerfOpts() pass case and fail case(learning cores will be executed). b. Regarding MsgID-- in nvmeInit.c line number 594, NVMe device supports number of msg ids equal to 1 admin + N IO queues. But in OFA driver, since the msg id 0 is shared between admin queue and io queue, always ((1 admin + N io queues) - 1) number of msgids is used. With MsgID--, we make sure all IO queues are created with unique msg id and msg id 0 is shared with admin queue and 1 io queue. If the device that you are testing has total number of msg ids equal to 1 admin queue + N io queues, then MsgID-- should not be a problem. But if you strongly feel that MsgID-- could be an issue, can you please take the perf benchmark after removing MsgID--. c. Can you please let us know the number of queues and number of Messaged IDs supported by the target device that you are testing with. d. On servers, we usually test on Windows Server edition OSes. When we tested with Windows 8.1, we observed that the number of logical processors supported in Windows 8.1 is maximum 32, even when the server have more than 32 logical processors. We will try to reproduce the performance degradation behavior, meanwhile can you please provide us more debug data. 2. nvmeStat.c @ line 784 : Agreed. We will change as suggested. 3. nvmeStat.c @line 899 : Agreed. We will change as suggested. Thanks, Suman ------- Original Message ------- Sender : Foster, Carolyn D> Date : Apr 06, 2016 05:33 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, I have a few comments and suggestions: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c – We noticed on some systems we saw a degradation in performance and we suspect it’s related to this change. If we don’t share MSIX vector 0 between the admin queue and an IO queue we are creating one fewer queue to submit IO to. Did you execute any performance testing before and after these changes? I have included some details about the system configurations we tested and the observed results below. 2. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. 3. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Performance configuration and data: OS: Windows 8.1 x64 Workload: 100% sequential Read Compared the OFA trunk to the Samsung patch Summary of observed results: • System with fewer than 32 logical CPU cores: o No delta in performance observed • System with between 32 and 64 cores: o 20%-50% drop in performance observed • System with more than 64 cores: o 30%-40% drop in performance observed Carolyn From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Monday, April 04, 2016 6:35 AM To: Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; anshul at samsung.com; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: Re: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, I am sending the updated patch incorporating feedback from Tom. The changes are listed below. The password is samsungnvme 1. Moved the StorPortFreePool()from IoCompletionRoutine() to NvmeInitCallback() - NvmeWaitOnNamespaceReady. 2. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus willnot be ONLINE if lun extension is zero'ed out. 3. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. To Mandatory reviewers: Can we get feedback or approval for this patch before 7th April. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Mar 29, 2016 20:27 (GMT+05:30) Title : Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Tom, Thanks for the review comments. Please find my replies below: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. [Suman] Agreed. We will move the StorPortFreePool to NvmeInitCallback. 2.a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. [Suman] Following changes are made: a. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus will not be ONLINE if lun extension is zero'ed out. b. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. Let me know if you have any questions. We will share the modified code once others share their feedback. Can we get feedback from other companies by 5th April? Thanks, Suman ------- Original Message ------- Sender : Thomas Freeman> Date : Mar 29, 2016 00:48 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, It looks good. I have a few comments here: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. 2. I ran into a few problems, here are the details: *My device configuration: I'm testing a device that supports NS management and it has multiple namespaces. Some of those namespaces are not attached. The format of some of those namespaces is not supported by the driver (e.g. LBA contains metadata) a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. i. If that lun is a detached namespace, the READ fails with a status code of 0xb. The driver attempts to retry until the READ is successful, but the command will never succeed. ii. During initialization, if the driver detects a namespace that is in an unsupported format, it zero's out that LUN entry, but leaves that zero’ed entry in the LUN extension list. When NVMeRunningWaitOnNamespaceReady is processing the list, it does not recognize this as a zero'ed out entry. Rather is attempts a READ from this namespace (the NSID is 0 since the init code zero'ed out that Lun list entry). The READ and all of its retries fail with a status code of 0xb. Proposed fix: 1. Before issuing a READ, ensure that namespace is attached and a valid format. If not, increment the counters and move to the next Lun. 2. Also, in NVMeInitCallback, when handling the case NVMeWaitOnNamespaceReady, instead of looking for an SC of 0x00, only issue a retry if the command fails with SC = 0x82 (NS not ready). Let me know if you have any questions. Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [http://www.samsung.net/service/ml/AttachController/image001.jpg?cmd=downdirectly&filepath=/LOCAL/ML/CACHE/s/20160406/image001.jpg at 01D18E55.4C7BA130309suman.p&contentType=IMAGE/JPEG;charset=UTF8&msgno=309&partno=2&foldername=INBOX&msguid=48143] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Wednesday, March 23, 2016 7:27 AM To: nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, This patch includes changes for optimizing the disk initialization performance and relevant changes. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** Disk Initialization Performance Optimization: We can use the StorPortInitializePerfOpts(), PERF_CONFIGURATION_DATA.MessageTargets which provides the array of MSI numbers corresponding to each logical processor. This is an alternative of using the Learning cores logic implemented in the OFA driver. Also this will directly reduce the time taken for the disk to be enumerated after a device hot insert. The current OFA driver does the following in its initialization path, let’s say on a server which has 32 logical processors and device which supports 32 queues – 1. Identify controller 2. Identify namespace - for N number of namespaces 3. Set features - Interrupt coalescing, number of queues, lba range type. 4. Create IO completion queue - 32 commands 5. Create IO submission queue - 32 commands 6. LearnMapping - 32 Read commands 7. ReSetupQueues - 32 Delete Sub queues + 32 Delete completion queues 8. Create IO completion queue - 32 commands 9. Create IO submission queue - 32 commands 10. Complete initialization state machine As can be observed, during disk initialization, around 224 commands are processed for setting up the IO queues and associate the MSI-x number to each queues. If we use StorPortInitializePerfOpts(), we required only 64 commands instead of 224 commands. On a server which as 120 logical processors, 840 commands are required for setting up the IO queues and associate the MSI-x number to each queues. If learning cores is avoided, only 240 commands are required instead of 840 commands. Also we can fall back to learning cores if the API StorPortInitiailzePerfOpts() fails. We see improved device up time after this change. Also, if the number of queues supported by the device is less than the number of logical processors, the driver does not execute the learning cores, hence there won’t be any improvement even if we use StorPortInitializePerfOpts(). Server with 32 logical processors: OFA version disk up time in seconds Disk capacity = 400 GB Disk capacity = 1.6 TB Disk from vendor 1 Disk from vendor 2 Rev 133 14 6.5 14.5 Latest 5 5 13.5 PS: data may change for different vendor SSDs Code changes: 1. Changes w.r.t StorPortInitializePerfOpts(). a. In NVMeInitialize(), moved the StorPortInitializePerfOpts() after NVMeEnumMsiMessages() to set the LastRedirectionMessageNumber in StorPortInitializePerfOpts(). b. Set the flags STOR_PERF_INTERRUPT_MESSAGE_RANGES and STOR_PERF_ADV_CONFIG_LOCALITY, and values FirstRedirectionMessageNumber, LastRedirectionMessageNumber and MessageTargets in StorPortInitializePerfOpts() to get the MSIx-Core mapping in MessageTargets. If this API returns success, the learning cores can be skipped. c. If the StorPortInitializePerfOpts() fails, in NVMeMsiMapCores(), the mapping of msix to cores in assigned sequentially, and learning cores is executed. During learning cores, in IoCompletionRoutine(), the msix to core is re-mapped. If the StorPortInitializePerfOpts() succeeds, in NVMeMsiMapCores(), the mapping of msix to cores is taken from MessageTargets and learnig cores is skipped. 2. When the learning cores is skipped, the controller initialization completes faster. But we have observed that in some devices, the Namespace is not ready to process I/O at this point. And when kernel send I/O, the device returns SC = NAMESPACE_NOT_READY and miniport returns SCSI_SENSEQ_BECOMING_READY, for which storport retries after some time. If the device takes too long to initialize the namespace, the storport gives up and shows as Uninitialized in the disk mgmt. Hence the controller initialization has to be completed after Namespace is ready. For this, a new state is introduce in the NVMeRunning(), which waits till the NS is ready. a. Introduced a new state NVMeWaitOnNamespaceReady in NVMeRunning(). b. In IoCompletionRoutine(), determine which CQ to look in based on WaitOnNamespaceReady state. c. In NVMeInitCallback(), implemented call back for NVMeWaitOnNamespaceReady. d. In IoCompletionRoutine(), free the read buffer for namespaceready. Note: a. We have observed that higher capacity Namespaces take too long to initialize, hence the passiveTimeout value in NVMePassiveInitialize() is not sufficient. We need to increase the timeout value based on vendor requirements. b. b. Checking for Namespace ready is skipped during dump/hibernation mode and resume from hibernation. 3. Usually, the number of MSIx supported by device and MSIx granted(StorPortGetMSIInfo) will be number of IO Queue + 1 Admin Queue. But, we share the Admin queue and first I/O queue in core 0, and hence MSIx 0 is shared between admin queue and first I/O queue. Incase of active cores more than Queues supported, one of the MSGID should not be considered. Made changes in In NVMeEnumMsiMessages() accordingly. For example, cores = 32, Admin + IO queues = 1 + 8, then MsgID(in NVMeEnumMsiMessages()) should be 8. 4. In IoCompletionRoutine(), for learning cores, only if MSIGranted is less than active cores, the QueueNo will be remapped in sequential manner. Otherwise QueueNo remains same as before. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [cid:image003.gif at 01D194B8.FD569BE0] [http://ext.samsung.net/mailcheck/SeenTimeChecker?do=9de5907ae3594b95bbb95d77db1aaa489b4e2ca6030fddd87d9badbdf7e30042d1afaaba7860cdcd9564217c646641ad61e16949eaa607501b20909a04efd4d2748cfe1d4e847419cf878f9a26ce15a0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 2938 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 823 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 13168 bytes Desc: image003.gif URL: From thomas.freeman at hgst.com Tue Apr 12 15:17:55 2016 From: thomas.freeman at hgst.com (Thomas Freeman) Date: Tue, 12 Apr 2016 22:17:55 +0000 Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance In-Reply-To: References: Message-ID: Hi Suman, HGST approves these changes. Thank you, Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image004.jpg at 01D194DF.3D2C0D10] From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Tuesday, April 12, 2016 8:24 AM To: Foster, Carolyn D ; Thomas Freeman ; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim ; ANSHUL SHARMA ; MANOJ THAPLIYAL ; tru.nguyen at ssi.samsung.com; prakash.v at samsung.com Subject: Re: Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi All, I am sending the updated patch incorporating feedback from Carolyn. The changes are listed below. The password is samsungnvme 1. In NVMeRunningWaitOnLearnMapping(), if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. 2. In NVMeRunningWaitOnNamespaceReady(), for crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Please let me know if you have any questions. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Apr 11, 2016 21:22 (GMT+05:30) Title : Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Carolyn, Thanks for testing our patch. We tried the following: a. Sequential read performance on single worker, single queue depth, on system with 32 cores, and did not observe performance drop compared to R133. b. Assuming that the IOs are getting scattered across multiple NUMA nodes and due to remote memory access, the 10% drop could have been observed, we affinitized the performance tool on NUMA node 1, assuming that the IO queue is created on NUMA node 0. After affinitization, application submits IO on node 1 and driver processes the IOs in node 0. But still, we did not observe any performance drop. As there are no major issues here, we will incorporate the following review comments and share the revised patch tomorrow: a. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. b. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Thanks, Suman ------- Original Message ------- Sender : Foster, Carolyn D Date : Apr 08, 2016 02:42 (GMT+05:30) Title : RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, I have an update on our performance data. It appears on systems with more than 32 and more than 64 cores that there was no performance delta in the following workloads: 4 workers, 32 Queue Depth 8 workers, 16 Queue Depth We did observe what looks like a more consistent (and much smaller) 10% drop on workload with a single worker, single queue depth, on systems with 32 cores. It seems that our initial results might have been flawed and based on your comments and performance analysis, there may be no major issue here. Carolyn From: Foster, Carolyn D Sent: Wednesday, April 06, 2016 2:54 PM To: 'suman.p at samsung.com' ; Thomas Freeman ; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim ; ANSHUL SHARMA ; MANOJ THAPLIYAL ; tru.nguyen at ssi.samsung.com Subject: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, thank you for the clarification. I will confirm the rest of the workload details and have that information for you tomorrow. In the mean time I will also rerun our performance tests to confirm that the results are reproducible, and will run the tests without line 594 as you suggested. Carolyn From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Wednesday, April 06, 2016 9:01 AM To: Foster, Carolyn D >; Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; ANSHUL SHARMA >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Carolyn, Thanks for the comments and suggestions. Please find my comments below: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c a. We tested on Servers with 32 and 64 logical processors, Windows 8.1 and 2012 R2 OS, multiple vendor devices with both 1-to-1 core-queue mapping and many-to-1 core-queue mapping, with both R133 and latest drivers, and we did not observe any performance drop for 100% Sequential Read(128K) with 32 and 64 worker threads, and queue depth 32. We have tested both the StorPortInitializePerfOpts() pass case and fail case(learning cores will be executed). b. Regarding MsgID-- in nvmeInit.c line number 594, NVMe device supports number of msg ids equal to 1 admin + N IO queues. But in OFA driver, since the msg id 0 is shared between admin queue and io queue, always ((1 admin + N io queues) - 1) number of msgids is used. With MsgID--, we make sure all IO queues are created with unique msg id and msg id 0 is shared with admin queue and 1 io queue. If the device that you are testing has total number of msg ids equal to 1 admin queue + N io queues, then MsgID-- should not be a problem. But if you strongly feel that MsgID-- could be an issue, can you please take the perf benchmark after removing MsgID--. c. Can you please let us know the number of queues and number of Messaged IDs supported by the target device that you are testing with. d. On servers, we usually test on Windows Server edition OSes. When we tested with Windows 8.1, we observed that the number of logical processors supported in Windows 8.1 is maximum 32, even when the server have more than 32 logical processors. We will try to reproduce the performance degradation behavior, meanwhile can you please provide us more debug data. 2. nvmeStat.c @ line 784 : Agreed. We will change as suggested. 3. nvmeStat.c @line 899 : Agreed. We will change as suggested. Thanks, Suman ------- Original Message ------- Sender : Foster, Carolyn D> Date : Apr 06, 2016 05:33 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, I have a few comments and suggestions: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c – We noticed on some systems we saw a degradation in performance and we suspect it’s related to this change. If we don’t share MSIX vector 0 between the admin queue and an IO queue we are creating one fewer queue to submit IO to. Did you execute any performance testing before and after these changes? I have included some details about the system configurations we tested and the observed results below. 2. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. 3. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Performance configuration and data: OS: Windows 8.1 x64 Workload: 100% sequential Read Compared the OFA trunk to the Samsung patch Summary of observed results: • System with fewer than 32 logical CPU cores: o No delta in performance observed • System with between 32 and 64 cores: o 20%-50% drop in performance observed • System with more than 64 cores: o 30%-40% drop in performance observed Carolyn From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Monday, April 04, 2016 6:35 AM To: Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; anshul at samsung.com; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: Re: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, I am sending the updated patch incorporating feedback from Tom. The changes are listed below. The password is samsungnvme 1. Moved the StorPortFreePool()from IoCompletionRoutine() to NvmeInitCallback() - NvmeWaitOnNamespaceReady. 2. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus willnot be ONLINE if lun extension is zero'ed out. 3. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. To Mandatory reviewers: Can we get feedback or approval for this patch before 7th April. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Mar 29, 2016 20:27 (GMT+05:30) Title : Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Tom, Thanks for the review comments. Please find my replies below: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. [Suman] Agreed. We will move the StorPortFreePool to NvmeInitCallback. 2.a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. [Suman] Following changes are made: a. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus will not be ONLINE if lun extension is zero'ed out. b. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. Let me know if you have any questions. We will share the modified code once others share their feedback. Can we get feedback from other companies by 5th April? Thanks, Suman ------- Original Message ------- Sender : Thomas Freeman> Date : Mar 29, 2016 00:48 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, It looks good. I have a few comments here: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. 2. I ran into a few problems, here are the details: *My device configuration: I'm testing a device that supports NS management and it has multiple namespaces. Some of those namespaces are not attached. The format of some of those namespaces is not supported by the driver (e.g. LBA contains metadata) a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. i. If that lun is a detached namespace, the READ fails with a status code of 0xb. The driver attempts to retry until the READ is successful, but the command will never succeed. ii. During initialization, if the driver detects a namespace that is in an unsupported format, it zero's out that LUN entry, but leaves that zero’ed entry in the LUN extension list. When NVMeRunningWaitOnNamespaceReady is processing the list, it does not recognize this as a zero'ed out entry. Rather is attempts a READ from this namespace (the NSID is 0 since the init code zero'ed out that Lun list entry). The READ and all of its retries fail with a status code of 0xb. Proposed fix: 1. Before issuing a READ, ensure that namespace is attached and a valid format. If not, increment the counters and move to the next Lun. 2. Also, in NVMeInitCallback, when handling the case NVMeWaitOnNamespaceReady, instead of looking for an SC of 0x00, only issue a retry if the command fails with SC = 0x82 (NS not ready). Let me know if you have any questions. Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [http://www.samsung.net/service/ml/AttachController/image001.jpg?cmd=downdirectly&filepath=/LOCAL/ML/CACHE/s/20160406/image001.jpg at 01D18E55.4C7BA130309suman.p&contentType=IMAGE/JPEG;charset=UTF8&msgno=309&partno=2&foldername=INBOX&msguid=48143] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Wednesday, March 23, 2016 7:27 AM To: nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, This patch includes changes for optimizing the disk initialization performance and relevant changes. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** Disk Initialization Performance Optimization: We can use the StorPortInitializePerfOpts(), PERF_CONFIGURATION_DATA.MessageTargets which provides the array of MSI numbers corresponding to each logical processor. This is an alternative of using the Learning cores logic implemented in the OFA driver. Also this will directly reduce the time taken for the disk to be enumerated after a device hot insert. The current OFA driver does the following in its initialization path, let’s say on a server which has 32 logical processors and device which supports 32 queues – 1. Identify controller 2. Identify namespace - for N number of namespaces 3. Set features - Interrupt coalescing, number of queues, lba range type. 4. Create IO completion queue - 32 commands 5. Create IO submission queue - 32 commands 6. LearnMapping - 32 Read commands 7. ReSetupQueues - 32 Delete Sub queues + 32 Delete completion queues 8. Create IO completion queue - 32 commands 9. Create IO submission queue - 32 commands 10. Complete initialization state machine As can be observed, during disk initialization, around 224 commands are processed for setting up the IO queues and associate the MSI-x number to each queues. If we use StorPortInitializePerfOpts(), we required only 64 commands instead of 224 commands. On a server which as 120 logical processors, 840 commands are required for setting up the IO queues and associate the MSI-x number to each queues. If learning cores is avoided, only 240 commands are required instead of 840 commands. Also we can fall back to learning cores if the API StorPortInitiailzePerfOpts() fails. We see improved device up time after this change. Also, if the number of queues supported by the device is less than the number of logical processors, the driver does not execute the learning cores, hence there won’t be any improvement even if we use StorPortInitializePerfOpts(). Server with 32 logical processors: OFA version disk up time in seconds Disk capacity = 400 GB Disk capacity = 1.6 TB Disk from vendor 1 Disk from vendor 2 Rev 133 14 6.5 14.5 Latest 5 5 13.5 PS: data may change for different vendor SSDs Code changes: 1. Changes w.r.t StorPortInitializePerfOpts(). a. In NVMeInitialize(), moved the StorPortInitializePerfOpts() after NVMeEnumMsiMessages() to set the LastRedirectionMessageNumber in StorPortInitializePerfOpts(). b. Set the flags STOR_PERF_INTERRUPT_MESSAGE_RANGES and STOR_PERF_ADV_CONFIG_LOCALITY, and values FirstRedirectionMessageNumber, LastRedirectionMessageNumber and MessageTargets in StorPortInitializePerfOpts() to get the MSIx-Core mapping in MessageTargets. If this API returns success, the learning cores can be skipped. c. If the StorPortInitializePerfOpts() fails, in NVMeMsiMapCores(), the mapping of msix to cores in assigned sequentially, and learning cores is executed. During learning cores, in IoCompletionRoutine(), the msix to core is re-mapped. If the StorPortInitializePerfOpts() succeeds, in NVMeMsiMapCores(), the mapping of msix to cores is taken from MessageTargets and learnig cores is skipped. 2. When the learning cores is skipped, the controller initialization completes faster. But we have observed that in some devices, the Namespace is not ready to process I/O at this point. And when kernel send I/O, the device returns SC = NAMESPACE_NOT_READY and miniport returns SCSI_SENSEQ_BECOMING_READY, for which storport retries after some time. If the device takes too long to initialize the namespace, the storport gives up and shows as Uninitialized in the disk mgmt. Hence the controller initialization has to be completed after Namespace is ready. For this, a new state is introduce in the NVMeRunning(), which waits till the NS is ready. a. Introduced a new state NVMeWaitOnNamespaceReady in NVMeRunning(). b. In IoCompletionRoutine(), determine which CQ to look in based on WaitOnNamespaceReady state. c. In NVMeInitCallback(), implemented call back for NVMeWaitOnNamespaceReady. d. In IoCompletionRoutine(), free the read buffer for namespaceready. Note: a. We have observed that higher capacity Namespaces take too long to initialize, hence the passiveTimeout value in NVMePassiveInitialize() is not sufficient. We need to increase the timeout value based on vendor requirements. b. b. Checking for Namespace ready is skipped during dump/hibernation mode and resume from hibernation. 3. Usually, the number of MSIx supported by device and MSIx granted(StorPortGetMSIInfo) will be number of IO Queue + 1 Admin Queue. But, we share the Admin queue and first I/O queue in core 0, and hence MSIx 0 is shared between admin queue and first I/O queue. Incase of active cores more than Queues supported, one of the MSGID should not be considered. Made changes in In NVMeEnumMsiMessages() accordingly. For example, cores = 32, Admin + IO queues = 1 + 8, then MsgID(in NVMeEnumMsiMessages()) should be 8. 4. In IoCompletionRoutine(), for learning cores, only if MSIGranted is less than active cores, the QueueNo will be remapped in sequential manner. Otherwise QueueNo remains same as before. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [cid:image003.gif at 01D194DF.3D21D3E0] [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ~WRD000.jpg Type: image/jpeg Size: 823 bytes Desc: ~WRD000.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 2938 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 13168 bytes Desc: image003.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.jpg Type: image/jpeg Size: 2934 bytes Desc: image004.jpg URL: From carolyn.d.foster at intel.com Tue Apr 12 16:09:26 2016 From: carolyn.d.foster at intel.com (Foster, Carolyn D) Date: Tue, 12 Apr 2016 23:09:26 +0000 Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance In-Reply-To: References: Message-ID: Hi Suman, Intel approves the changes as well. Thank you, Carolyn From: Thomas Freeman [mailto:thomas.freeman at hgst.com] Sent: Tuesday, April 12, 2016 3:18 PM To: suman.p at samsung.com; Foster, Carolyn D ; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim ; ANSHUL SHARMA ; MANOJ THAPLIYAL ; tru.nguyen at ssi.samsung.com; prakash.v at samsung.com Subject: RE: Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, HGST approves these changes. Thank you, Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image001.jpg at 01D194D5.AD00C9F0] From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Tuesday, April 12, 2016 8:24 AM To: Foster, Carolyn D >; Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; ANSHUL SHARMA >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com; prakash.v at samsung.com Subject: Re: Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi All, I am sending the updated patch incorporating feedback from Carolyn. The changes are listed below. The password is samsungnvme 1. In NVMeRunningWaitOnLearnMapping(), if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. 2. In NVMeRunningWaitOnNamespaceReady(), for crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Please let me know if you have any questions. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Apr 11, 2016 21:22 (GMT+05:30) Title : Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Carolyn, Thanks for testing our patch. We tried the following: a. Sequential read performance on single worker, single queue depth, on system with 32 cores, and did not observe performance drop compared to R133. b. Assuming that the IOs are getting scattered across multiple NUMA nodes and due to remote memory access, the 10% drop could have been observed, we affinitized the performance tool on NUMA node 1, assuming that the IO queue is created on NUMA node 0. After affinitization, application submits IO on node 1 and driver processes the IOs in node 0. But still, we did not observe any performance drop. As there are no major issues here, we will incorporate the following review comments and share the revised patch tomorrow: a. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. b. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Thanks, Suman ------- Original Message ------- Sender : Foster, Carolyn D> Date : Apr 08, 2016 02:42 (GMT+05:30) Title : RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, I have an update on our performance data. It appears on systems with more than 32 and more than 64 cores that there was no performance delta in the following workloads: 4 workers, 32 Queue Depth 8 workers, 16 Queue Depth We did observe what looks like a more consistent (and much smaller) 10% drop on workload with a single worker, single queue depth, on systems with 32 cores. It seems that our initial results might have been flawed and based on your comments and performance analysis, there may be no major issue here. Carolyn From: Foster, Carolyn D Sent: Wednesday, April 06, 2016 2:54 PM To: 'suman.p at samsung.com' >; Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; ANSHUL SHARMA >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, thank you for the clarification. I will confirm the rest of the workload details and have that information for you tomorrow. In the mean time I will also rerun our performance tests to confirm that the results are reproducible, and will run the tests without line 594 as you suggested. Carolyn From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Wednesday, April 06, 2016 9:01 AM To: Foster, Carolyn D >; Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; ANSHUL SHARMA >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Carolyn, Thanks for the comments and suggestions. Please find my comments below: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c a. We tested on Servers with 32 and 64 logical processors, Windows 8.1 and 2012 R2 OS, multiple vendor devices with both 1-to-1 core-queue mapping and many-to-1 core-queue mapping, with both R133 and latest drivers, and we did not observe any performance drop for 100% Sequential Read(128K) with 32 and 64 worker threads, and queue depth 32. We have tested both the StorPortInitializePerfOpts() pass case and fail case(learning cores will be executed). b. Regarding MsgID-- in nvmeInit.c line number 594, NVMe device supports number of msg ids equal to 1 admin + N IO queues. But in OFA driver, since the msg id 0 is shared between admin queue and io queue, always ((1 admin + N io queues) - 1) number of msgids is used. With MsgID--, we make sure all IO queues are created with unique msg id and msg id 0 is shared with admin queue and 1 io queue. If the device that you are testing has total number of msg ids equal to 1 admin queue + N io queues, then MsgID-- should not be a problem. But if you strongly feel that MsgID-- could be an issue, can you please take the perf benchmark after removing MsgID--. c. Can you please let us know the number of queues and number of Messaged IDs supported by the target device that you are testing with. d. On servers, we usually test on Windows Server edition OSes. When we tested with Windows 8.1, we observed that the number of logical processors supported in Windows 8.1 is maximum 32, even when the server have more than 32 logical processors. We will try to reproduce the performance degradation behavior, meanwhile can you please provide us more debug data. 2. nvmeStat.c @ line 784 : Agreed. We will change as suggested. 3. nvmeStat.c @line 899 : Agreed. We will change as suggested. Thanks, Suman ------- Original Message ------- Sender : Foster, Carolyn D> Date : Apr 06, 2016 05:33 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, I have a few comments and suggestions: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c – We noticed on some systems we saw a degradation in performance and we suspect it’s related to this change. If we don’t share MSIX vector 0 between the admin queue and an IO queue we are creating one fewer queue to submit IO to. Did you execute any performance testing before and after these changes? I have included some details about the system configurations we tested and the observed results below. 2. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. 3. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Performance configuration and data: OS: Windows 8.1 x64 Workload: 100% sequential Read Compared the OFA trunk to the Samsung patch Summary of observed results: • System with fewer than 32 logical CPU cores: o No delta in performance observed • System with between 32 and 64 cores: o 20%-50% drop in performance observed • System with more than 64 cores: o 30%-40% drop in performance observed Carolyn From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Monday, April 04, 2016 6:35 AM To: Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; anshul at samsung.com; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: Re: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, I am sending the updated patch incorporating feedback from Tom. The changes are listed below. The password is samsungnvme 1. Moved the StorPortFreePool()from IoCompletionRoutine() to NvmeInitCallback() - NvmeWaitOnNamespaceReady. 2. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus willnot be ONLINE if lun extension is zero'ed out. 3. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. To Mandatory reviewers: Can we get feedback or approval for this patch before 7th April. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Mar 29, 2016 20:27 (GMT+05:30) Title : Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Tom, Thanks for the review comments. Please find my replies below: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. [Suman] Agreed. We will move the StorPortFreePool to NvmeInitCallback. 2.a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. [Suman] Following changes are made: a. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus will not be ONLINE if lun extension is zero'ed out. b. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. Let me know if you have any questions. We will share the modified code once others share their feedback. Can we get feedback from other companies by 5th April? Thanks, Suman ------- Original Message ------- Sender : Thomas Freeman> Date : Mar 29, 2016 00:48 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, It looks good. I have a few comments here: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. 2. I ran into a few problems, here are the details: *My device configuration: I'm testing a device that supports NS management and it has multiple namespaces. Some of those namespaces are not attached. The format of some of those namespaces is not supported by the driver (e.g. LBA contains metadata) a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. i. If that lun is a detached namespace, the READ fails with a status code of 0xb. The driver attempts to retry until the READ is successful, but the command will never succeed. ii. During initialization, if the driver detects a namespace that is in an unsupported format, it zero's out that LUN entry, but leaves that zero’ed entry in the LUN extension list. When NVMeRunningWaitOnNamespaceReady is processing the list, it does not recognize this as a zero'ed out entry. Rather is attempts a READ from this namespace (the NSID is 0 since the init code zero'ed out that Lun list entry). The READ and all of its retries fail with a status code of 0xb. Proposed fix: 1. Before issuing a READ, ensure that namespace is attached and a valid format. If not, increment the counters and move to the next Lun. 2. Also, in NVMeInitCallback, when handling the case NVMeWaitOnNamespaceReady, instead of looking for an SC of 0x00, only issue a retry if the command fails with SC = 0x82 (NS not ready). Let me know if you have any questions. Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [http://www.samsung.net/service/ml/AttachController/image001.jpg?cmd=downdirectly&filepath=/LOCAL/ML/CACHE/s/20160406/image001.jpg at 01D18E55.4C7BA130309suman.p&contentType=IMAGE/JPEG;charset=UTF8&msgno=309&partno=2&foldername=INBOX&msguid=48143] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Wednesday, March 23, 2016 7:27 AM To: nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, This patch includes changes for optimizing the disk initialization performance and relevant changes. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** Disk Initialization Performance Optimization: We can use the StorPortInitializePerfOpts(), PERF_CONFIGURATION_DATA.MessageTargets which provides the array of MSI numbers corresponding to each logical processor. This is an alternative of using the Learning cores logic implemented in the OFA driver. Also this will directly reduce the time taken for the disk to be enumerated after a device hot insert. The current OFA driver does the following in its initialization path, let’s say on a server which has 32 logical processors and device which supports 32 queues – 1. Identify controller 2. Identify namespace - for N number of namespaces 3. Set features - Interrupt coalescing, number of queues, lba range type. 4. Create IO completion queue - 32 commands 5. Create IO submission queue - 32 commands 6. LearnMapping - 32 Read commands 7. ReSetupQueues - 32 Delete Sub queues + 32 Delete completion queues 8. Create IO completion queue - 32 commands 9. Create IO submission queue - 32 commands 10. Complete initialization state machine As can be observed, during disk initialization, around 224 commands are processed for setting up the IO queues and associate the MSI-x number to each queues. If we use StorPortInitializePerfOpts(), we required only 64 commands instead of 224 commands. On a server which as 120 logical processors, 840 commands are required for setting up the IO queues and associate the MSI-x number to each queues. If learning cores is avoided, only 240 commands are required instead of 840 commands. Also we can fall back to learning cores if the API StorPortInitiailzePerfOpts() fails. We see improved device up time after this change. Also, if the number of queues supported by the device is less than the number of logical processors, the driver does not execute the learning cores, hence there won’t be any improvement even if we use StorPortInitializePerfOpts(). Server with 32 logical processors: OFA version disk up time in seconds Disk capacity = 400 GB Disk capacity = 1.6 TB Disk from vendor 1 Disk from vendor 2 Rev 133 14 6.5 14.5 Latest 5 5 13.5 PS: data may change for different vendor SSDs Code changes: 1. Changes w.r.t StorPortInitializePerfOpts(). a. In NVMeInitialize(), moved the StorPortInitializePerfOpts() after NVMeEnumMsiMessages() to set the LastRedirectionMessageNumber in StorPortInitializePerfOpts(). b. Set the flags STOR_PERF_INTERRUPT_MESSAGE_RANGES and STOR_PERF_ADV_CONFIG_LOCALITY, and values FirstRedirectionMessageNumber, LastRedirectionMessageNumber and MessageTargets in StorPortInitializePerfOpts() to get the MSIx-Core mapping in MessageTargets. If this API returns success, the learning cores can be skipped. c. If the StorPortInitializePerfOpts() fails, in NVMeMsiMapCores(), the mapping of msix to cores in assigned sequentially, and learning cores is executed. During learning cores, in IoCompletionRoutine(), the msix to core is re-mapped. If the StorPortInitializePerfOpts() succeeds, in NVMeMsiMapCores(), the mapping of msix to cores is taken from MessageTargets and learnig cores is skipped. 2. When the learning cores is skipped, the controller initialization completes faster. But we have observed that in some devices, the Namespace is not ready to process I/O at this point. And when kernel send I/O, the device returns SC = NAMESPACE_NOT_READY and miniport returns SCSI_SENSEQ_BECOMING_READY, for which storport retries after some time. If the device takes too long to initialize the namespace, the storport gives up and shows as Uninitialized in the disk mgmt. Hence the controller initialization has to be completed after Namespace is ready. For this, a new state is introduce in the NVMeRunning(), which waits till the NS is ready. a. Introduced a new state NVMeWaitOnNamespaceReady in NVMeRunning(). b. In IoCompletionRoutine(), determine which CQ to look in based on WaitOnNamespaceReady state. c. In NVMeInitCallback(), implemented call back for NVMeWaitOnNamespaceReady. d. In IoCompletionRoutine(), free the read buffer for namespaceready. Note: a. We have observed that higher capacity Namespaces take too long to initialize, hence the passiveTimeout value in NVMePassiveInitialize() is not sufficient. We need to increase the timeout value based on vendor requirements. b. b. Checking for Namespace ready is skipped during dump/hibernation mode and resume from hibernation. 3. Usually, the number of MSIx supported by device and MSIx granted(StorPortGetMSIInfo) will be number of IO Queue + 1 Admin Queue. But, we share the Admin queue and first I/O queue in core 0, and hence MSIx 0 is shared between admin queue and first I/O queue. Incase of active cores more than Queues supported, one of the MSGID should not be considered. Made changes in In NVMeEnumMsiMessages() accordingly. For example, cores = 32, Admin + IO queues = 1 + 8, then MsgID(in NVMeEnumMsiMessages()) should be 8. 4. In IoCompletionRoutine(), for learning cores, only if MSIGranted is less than active cores, the QueueNo will be remapped in sequential manner. Otherwise QueueNo remains same as before. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [cid:image005.gif at 01D194D5.AD00C9F0] [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 2934 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 2938 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.jpg Type: image/jpeg Size: 823 bytes Desc: image004.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image005.gif Type: image/gif Size: 13168 bytes Desc: image005.gif URL: From suman.p at samsung.com Wed Apr 13 08:38:28 2016 From: suman.p at samsung.com (SUMAN PRAKASH B) Date: Wed, 13 Apr 2016 15:38:28 +0000 (GMT) Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Message-ID: <84.07.04870.4F76E075@epcpsbgx3.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604132109846_Z5JE7EUA.jpg Type: image/jpeg Size: 2934 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604132109866_LK7CT9SZ.jpg Type: image/jpeg Size: 2938 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604132109921_BSL8PYMC.jpg Type: image/jpeg Size: 823 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604132109931_4XEV4D4T.gif Type: image/gif Size: 13168 bytes Desc: not available URL: From raymond.c.robles at intel.com Wed Apr 13 13:30:58 2016 From: raymond.c.robles at intel.com (Robles, Raymond C) Date: Wed, 13 Apr 2016 20:30:58 +0000 Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance In-Reply-To: <84.07.04870.4F76E075@epcpsbgx3.samsung.com> References: <84.07.04870.4F76E075@epcpsbgx3.samsung.com> Message-ID: <49158E750348AA499168FD41D8898360726B5769@fmsmsx117.amr.corp.intel.com> Hi Suman, I’ll push the patch today. Please go ahead and prepare your next patch. Thanks! From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Wednesday, April 13, 2016 8:38 AM To: Foster, Carolyn D ; Thomas Freeman ; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim ; PRAKASH BABU VEMULA ; ANSHUL SHARMA ; MANOJ THAPLIYAL ; tru.nguyen at ssi.samsung.com Subject: Re: [nvmewin] Patch with changes for Optimizing disk initialization performance Thanks Tom and Carolyn. @Ray, since Intel and HGST has approved this patch, can we close this out. We will be ready with our next patch by early next week. - Suman ------- Original Message ------- Sender : Foster, Carolyn D> Date : Apr 13, 2016 04:39 (GMT+05:30) Title : RE: Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, Intel approves the changes as well. Thank you, Carolyn From: Thomas Freeman [mailto:thomas.freeman at hgst.com] Sent: Tuesday, April 12, 2016 3:18 PM To: suman.p at samsung.com; Foster, Carolyn D >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; ANSHUL SHARMA >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com; prakash.v at samsung.com Subject: RE: Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, HGST approves these changes. Thank you, Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image001.jpg at 01D19588.B5AD7630] From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Tuesday, April 12, 2016 8:24 AM To: Foster, Carolyn D >; Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; ANSHUL SHARMA >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com; prakash.v at samsung.com Subject: Re: Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi All, I am sending the updated patch incorporating feedback from Carolyn. The changes are listed below. The password is samsungnvme 1. In NVMeRunningWaitOnLearnMapping(), if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. 2. In NVMeRunningWaitOnNamespaceReady(), for crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Please let me know if you have any questions. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Apr 11, 2016 21:22 (GMT+05:30) Title : Re: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Carolyn, Thanks for testing our patch. We tried the following: a. Sequential read performance on single worker, single queue depth, on system with 32 cores, and did not observe performance drop compared to R133. b. Assuming that the IOs are getting scattered across multiple NUMA nodes and due to remote memory access, the 10% drop could have been observed, we affinitized the performance tool on NUMA node 1, assuming that the IO queue is created on NUMA node 0. After affinitization, application submits IO on node 1 and driver processes the IOs in node 0. But still, we did not observe any performance drop. As there are no major issues here, we will incorporate the following review comments and share the revised patch tomorrow: a. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. b. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Thanks, Suman ------- Original Message ------- Sender : Foster, Carolyn D> Date : Apr 08, 2016 02:42 (GMT+05:30) Title : RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, I have an update on our performance data. It appears on systems with more than 32 and more than 64 cores that there was no performance delta in the following workloads: 4 workers, 32 Queue Depth 8 workers, 16 Queue Depth We did observe what looks like a more consistent (and much smaller) 10% drop on workload with a single worker, single queue depth, on systems with 32 cores. It seems that our initial results might have been flawed and based on your comments and performance analysis, there may be no major issue here. Carolyn From: Foster, Carolyn D Sent: Wednesday, April 06, 2016 2:54 PM To: 'suman.p at samsung.com' >; Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; ANSHUL SHARMA >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: RE: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, thank you for the clarification. I will confirm the rest of the workload details and have that information for you tomorrow. In the mean time I will also rerun our performance tests to confirm that the results are reproducible, and will run the tests without line 594 as you suggested. Carolyn From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Wednesday, April 06, 2016 9:01 AM To: Foster, Carolyn D >; Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; ANSHUL SHARMA >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Carolyn, Thanks for the comments and suggestions. Please find my comments below: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c a. We tested on Servers with 32 and 64 logical processors, Windows 8.1 and 2012 R2 OS, multiple vendor devices with both 1-to-1 core-queue mapping and many-to-1 core-queue mapping, with both R133 and latest drivers, and we did not observe any performance drop for 100% Sequential Read(128K) with 32 and 64 worker threads, and queue depth 32. We have tested both the StorPortInitializePerfOpts() pass case and fail case(learning cores will be executed). b. Regarding MsgID-- in nvmeInit.c line number 594, NVMe device supports number of msg ids equal to 1 admin + N IO queues. But in OFA driver, since the msg id 0 is shared between admin queue and io queue, always ((1 admin + N io queues) - 1) number of msgids is used. With MsgID--, we make sure all IO queues are created with unique msg id and msg id 0 is shared with admin queue and 1 io queue. If the device that you are testing has total number of msg ids equal to 1 admin queue + N io queues, then MsgID-- should not be a problem. But if you strongly feel that MsgID-- could be an issue, can you please take the perf benchmark after removing MsgID--. c. Can you please let us know the number of queues and number of Messaged IDs supported by the target device that you are testing with. d. On servers, we usually test on Windows Server edition OSes. When we tested with Windows 8.1, we observed that the number of logical processors supported in Windows 8.1 is maximum 32, even when the server have more than 32 logical processors. We will try to reproduce the performance degradation behavior, meanwhile can you please provide us more debug data. 2. nvmeStat.c @ line 784 : Agreed. We will change as suggested. 3. nvmeStat.c @line 899 : Agreed. We will change as suggested. Thanks, Suman ------- Original Message ------- Sender : Foster, Carolyn D> Date : Apr 06, 2016 05:33 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, I have a few comments and suggestions: 1. Observed performance degradation, potentially related to line 594 in nvmeInit.c – We noticed on some systems we saw a degradation in performance and we suspect it’s related to this change. If we don’t share MSIX vector 0 between the admin queue and an IO queue we are creating one fewer queue to submit IO to. Did you execute any performance testing before and after these changes? I have included some details about the system configurations we tested and the observed results below. 2. nvmeStat.c @ line 784 : if there is zero namespace it is not necessary to go to NVMeWaitOnNamespaceReady , instead we can directly start NVMeStartComplete. 3. nvmeStat.c @line 899 : In crash/Hibernate mode it is not necessary to go to the NVMeWaitOnNamespaceReady. Performance configuration and data: OS: Windows 8.1 x64 Workload: 100% sequential Read Compared the OFA trunk to the Samsung patch Summary of observed results: • System with fewer than 32 logical CPU cores: o No delta in performance observed • System with between 32 and 64 cores: o 20%-50% drop in performance observed • System with more than 64 cores: o 30%-40% drop in performance observed Carolyn From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Monday, April 04, 2016 6:35 AM To: Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; anshul at samsung.com; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: Re: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, I am sending the updated patch incorporating feedback from Tom. The changes are listed below. The password is samsungnvme 1. Moved the StorPortFreePool()from IoCompletionRoutine() to NvmeInitCallback() - NvmeWaitOnNamespaceReady. 2. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus willnot be ONLINE if lun extension is zero'ed out. 3. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. To Mandatory reviewers: Can we get feedback or approval for this patch before 7th April. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Mar 29, 2016 20:27 (GMT+05:30) Title : Re: RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Tom, Thanks for the review comments. Please find my replies below: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. [Suman] Agreed. We will move the StorPortFreePool to NvmeInitCallback. 2.a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. [Suman] Following changes are made: a. In NVMeRunningWaitOnNamespaceReady(), the READ command will be sent only when the NS is ATTACHED and the pLunExt->slotStatus is ONLINE. pLunExt->slotStatus will not be ONLINE if lun extension is zero'ed out. b. In NvmeInitCallBack(), in case NVMeWaitOnNamespaceReady, the READ will be retried only if SC = 0x82, else move to the next NS. If the NS LBA format is unsupported, miniport sends the READ command, for which device will return SC=0xb, and miniport will move to next NS. Let me know if you have any questions. We will share the modified code once others share their feedback. Can we get feedback from other companies by 5th April? Thanks, Suman ------- Original Message ------- Sender : Thomas Freeman> Date : Mar 29, 2016 00:48 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi Suman, It looks good. I have a few comments here: 1. nvmeStd.c::IoCompletionRoutine, when checking for NVMeWaitOnNamespaceReady, would it be better to make this check and free the buffer in NvmeInitCallback (when processing NVMeWaitOnNamespaceReady). The check in IoCompletionRoutine is executed during the processing of every IO command, but it will only ever be TRUE during initialization. 2. I ran into a few problems, here are the details: *My device configuration: I'm testing a device that supports NS management and it has multiple namespaces. Some of those namespaces are not attached. The format of some of those namespaces is not supported by the driver (e.g. LBA contains metadata) a. With your change, in the method NVMeRunningWaitOnNamespaceReady the driver picks the next lun in the list and issues a READ to that namespace. i. If that lun is a detached namespace, the READ fails with a status code of 0xb. The driver attempts to retry until the READ is successful, but the command will never succeed. ii. During initialization, if the driver detects a namespace that is in an unsupported format, it zero's out that LUN entry, but leaves that zero’ed entry in the LUN extension list. When NVMeRunningWaitOnNamespaceReady is processing the list, it does not recognize this as a zero'ed out entry. Rather is attempts a READ from this namespace (the NSID is 0 since the init code zero'ed out that Lun list entry). The READ and all of its retries fail with a status code of 0xb. Proposed fix: 1. Before issuing a READ, ensure that namespace is attached and a valid format. If not, increment the counters and move to the next Lun. 2. Also, in NVMeInitCallback, when handling the case NVMeWaitOnNamespaceReady, instead of looking for an SC of 0x00, only issue a retry if the command fails with SC = 0x82 (NS not ready). Let me know if you have any questions. Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [http://www.samsung.net/service/ml/AttachController/image001.jpg?cmd=downdirectly&filepath=/LOCAL/ML/CACHE/s/20160406/image001.jpg at 01D18E55.4C7BA130309suman.p&contentType=IMAGE/JPEG;charset=UTF8&msgno=309&partno=2&foldername=INBOX&msguid=48143] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Wednesday, March 23, 2016 7:27 AM To: nvmewin at lists.openfabrics.org Cc: Seokhwan Kim >; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for Optimizing disk initialization performance Hi all, This patch includes changes for optimizing the disk initialization performance and relevant changes. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** Disk Initialization Performance Optimization: We can use the StorPortInitializePerfOpts(), PERF_CONFIGURATION_DATA.MessageTargets which provides the array of MSI numbers corresponding to each logical processor. This is an alternative of using the Learning cores logic implemented in the OFA driver. Also this will directly reduce the time taken for the disk to be enumerated after a device hot insert. The current OFA driver does the following in its initialization path, let’s say on a server which has 32 logical processors and device which supports 32 queues – 1. Identify controller 2. Identify namespace - for N number of namespaces 3. Set features - Interrupt coalescing, number of queues, lba range type. 4. Create IO completion queue - 32 commands 5. Create IO submission queue - 32 commands 6. LearnMapping - 32 Read commands 7. ReSetupQueues - 32 Delete Sub queues + 32 Delete completion queues 8. Create IO completion queue - 32 commands 9. Create IO submission queue - 32 commands 10. Complete initialization state machine As can be observed, during disk initialization, around 224 commands are processed for setting up the IO queues and associate the MSI-x number to each queues. If we use StorPortInitializePerfOpts(), we required only 64 commands instead of 224 commands. On a server which as 120 logical processors, 840 commands are required for setting up the IO queues and associate the MSI-x number to each queues. If learning cores is avoided, only 240 commands are required instead of 840 commands. Also we can fall back to learning cores if the API StorPortInitiailzePerfOpts() fails. We see improved device up time after this change. Also, if the number of queues supported by the device is less than the number of logical processors, the driver does not execute the learning cores, hence there won’t be any improvement even if we use StorPortInitializePerfOpts(). Server with 32 logical processors: OFA version disk up time in seconds Disk capacity = 400 GB Disk capacity = 1.6 TB Disk from vendor 1 Disk from vendor 2 Rev 133 14 6.5 14.5 Latest 5 5 13.5 PS: data may change for different vendor SSDs Code changes: 1. Changes w.r.t StorPortInitializePerfOpts(). a. In NVMeInitialize(), moved the StorPortInitializePerfOpts() after NVMeEnumMsiMessages() to set the LastRedirectionMessageNumber in StorPortInitializePerfOpts(). b. Set the flags STOR_PERF_INTERRUPT_MESSAGE_RANGES and STOR_PERF_ADV_CONFIG_LOCALITY, and values FirstRedirectionMessageNumber, LastRedirectionMessageNumber and MessageTargets in StorPortInitializePerfOpts() to get the MSIx-Core mapping in MessageTargets. If this API returns success, the learning cores can be skipped. c. If the StorPortInitializePerfOpts() fails, in NVMeMsiMapCores(), the mapping of msix to cores in assigned sequentially, and learning cores is executed. During learning cores, in IoCompletionRoutine(), the msix to core is re-mapped. If the StorPortInitializePerfOpts() succeeds, in NVMeMsiMapCores(), the mapping of msix to cores is taken from MessageTargets and learnig cores is skipped. 2. When the learning cores is skipped, the controller initialization completes faster. But we have observed that in some devices, the Namespace is not ready to process I/O at this point. And when kernel send I/O, the device returns SC = NAMESPACE_NOT_READY and miniport returns SCSI_SENSEQ_BECOMING_READY, for which storport retries after some time. If the device takes too long to initialize the namespace, the storport gives up and shows as Uninitialized in the disk mgmt. Hence the controller initialization has to be completed after Namespace is ready. For this, a new state is introduce in the NVMeRunning(), which waits till the NS is ready. a. Introduced a new state NVMeWaitOnNamespaceReady in NVMeRunning(). b. In IoCompletionRoutine(), determine which CQ to look in based on WaitOnNamespaceReady state. c. In NVMeInitCallback(), implemented call back for NVMeWaitOnNamespaceReady. d. In IoCompletionRoutine(), free the read buffer for namespaceready. Note: a. We have observed that higher capacity Namespaces take too long to initialize, hence the passiveTimeout value in NVMePassiveInitialize() is not sufficient. We need to increase the timeout value based on vendor requirements. b. b. Checking for Namespace ready is skipped during dump/hibernation mode and resume from hibernation. 3. Usually, the number of MSIx supported by device and MSIx granted(StorPortGetMSIInfo) will be number of IO Queue + 1 Admin Queue. But, we share the Admin queue and first I/O queue in core 0, and hence MSIx 0 is shared between admin queue and first I/O queue. Incase of active cores more than Queues supported, one of the MSGID should not be considered. Made changes in In NVMeEnumMsiMessages() accordingly. For example, cores = 32, Admin + IO queues = 1 + 8, then MsgID(in NVMeEnumMsiMessages()) should be 8. 4. In IoCompletionRoutine(), for learning cores, only if MSIGranted is less than active cores, the QueueNo will be remapped in sequential manner. Otherwise QueueNo remains same as before. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [cid:image004.gif at 01D19588.B5AD7630] [http://ext.samsung.net/mailcheck/SeenTimeChecker?do=9de5907ae3594b95091fc832a142fdf10677d2372ea55e697d9badbdf7e30042d1afaaba7860cdcd9564217c646641ad61e16949eaa607501b20909a04efd4d2748cfe1d4e847419cf878f9a26ce15a0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 2934 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 2938 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.jpg Type: image/jpeg Size: 823 bytes Desc: image003.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.gif Type: image/gif Size: 13168 bytes Desc: image004.gif URL: From suman.p at samsung.com Tue Apr 19 06:45:25 2016 From: suman.p at samsung.com (SUMAN PRAKASH B) Date: Tue, 19 Apr 2016 13:45:25 +0000 (GMT) Subject: [nvmewin] Patch with changes for disk Read only support Message-ID: <2A.D8.04892.57636175@epcpsbgx2.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: NVMe Disk End of Life support.docx Type: application/octet-stream Size: 16216 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Samsung_DiskReadOnlySupport_v1.7z Type: application/octet-stream Size: 146940 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: DiskMgmt.jpg Type: application/octet-stream Size: 8246 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: FileCopy.jpg Type: application/octet-stream Size: 12760 bytes Desc: not available URL: From thomas.freeman at hgst.com Wed Apr 20 09:29:55 2016 From: thomas.freeman at hgst.com (Thomas Freeman) Date: Wed, 20 Apr 2016 16:29:55 +0000 Subject: [nvmewin] Patch with changes for disk Read only support In-Reply-To: <2A.D8.04892.57636175@epcpsbgx2.samsung.com> References: <2A.D8.04892.57636175@epcpsbgx2.samsung.com> Message-ID: Hi Suman, After reviewing the code, I have a few questions/comments: 1. For mode sense with 0x3f, you set the WP in the response header. Shouldn WP also be set if any of those pages(0x8, 0xa, 0x1a, 0x1c) are individually requested? 2. snti.c:8242 Lun = pLunExt->namespaceId - 1; pDevExt->pLunExtensionTable[Lun]->IsNamespaceReadOnly = TRUE; These 2 lines can be replaced with pLunExt->IsNamespaceReadOnly = TRUE; Also, the original code is not a reliable way to determine the LUN id. Here is an example where there doesn't work. The device has attached namespaces 1,3 & 4 and Existing namespaces of 1, 2, 3 & 4. LUNs 0-3 will correspond to namespaces 1,3,4,2. For namespace 3, the calculation NSID-1=lun will incorrectly give you LUNid of 2. 3. Along with the new member, IsNamespaceReadOnly, the nvme_lun_extension also has ReadOnly. It seems like the setting of WP should take into account the value of both members. 4. If the NVMe command Get Log Page fails, (SCT != Generic_command_status || SC != Successful completion), the buffer pSrbExt->pDatBuffer is not freed. This corresponds to the allocation at snti.c:6530. 5. snti.c:6539: I think the following can be eliminated. The same copy occurs during SntiTranslateModeSenseResponse - snti.c:8272. if (GET_DATA_BUFFER(pSrb) != NULL) { StorPortCopyMemory((PVOID)GET_DATA_BUFFER(pSrb), (PVOID)(pSrbExt->modeSenseBuf), GET_DATA_LENGTH(pSrb)); } Let me know if you have questions, Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image002.jpg at 01D19AF7.F4507C10] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Tuesday, April 19, 2016 8:45 AM To: nvmewin at lists.openfabrics.org Cc: sukka.kim at samsung.com; prakash.v at samsung.com; anshul at samsung.com; MANOJ THAPLIYAL ; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for disk Read only support Hi all, This patch includes changes for supporting NVMe Disk read only mode. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** NVMe Disk End of Life support: Whenever NVMe disk exhausts the P/E cycles, the disk become Read only(reaches End Of Life). In this case, the user should be able to read the data from the disk for backup or migration purpose. To achieve this, the driver should inform the kernel that disk has become read only. If driver does not inform the kernel, the disk will be unusable from Windows. The device has to be detected as Read only in following 2 scenarios - a. Detection during device hot plug When a Read only device is hot inserted, the kernel should be able to enumerate the device as Read only and alert the user accordingly. When the SSD is hot inserted, as part of disk initialization process, a SCSI mode sense command with page code 'Return all pages' (0x3f) is requested by the kernel. The mode page has a mode parameter header, which has a WP bit in the 'Device specific Parameter' field which indicates if the device is Write Protected for some reason. We can make use of this field to report to the kernel that the device has become Read only. When the miniport driver receives this request, the NVM Express command Get log page is built with log identifier 'SMART / Health Information' (0x2) and send to the device. The SMART data has a 'Critical Warning' field in which a bit 'MediaInReadOnlyMode' is set whenever the media becomes Read only. So if the device returns SMART data with this bit set, the miniport driver sets the Device specific parameter - WP bit in mode parameter header and completes the command. When the WP bit is set in the mode parameter header, the kernel will understand that the device is Write protected and hence kernel will not send any more write requests. b. Detection during run time When the device is in use and the Write exhausts and device becomes Read only, the kernel has to immediately report to the user that device has become write protected. To achieve this, whenever the device receives a NVMe Write request after it has become Read only, the device sets SCT to Command Specific Status and SC to 'Attempted Write to Read Only Range' in response to the write command. For this the following sense data is returned for the corresponding SCSI write command. Sense data - SCSI_SENSE_DATA_PROTECT, ASC - SCSI_ADSENSE_WRITE_PROTECT and ASCQ - SCSI_ADSENSE_NO_SENSE. With this sense data, the kernel will understand that the device is in Write protected state for which the Mode sense command with mode page 'Return all pages' will be send to the device. Again with the NVM Express Get log page - SMART command, the miniport driver will return the mode sense 'Data Specific parameter' accordingly. Code changes: 1. In SntiReturnAllModePages(), build get log page for SMART/health information and send to device. 2. In SntiTranslateModeSenseResponse(), for log page MODE_SENSE_RETURN_ALL, set the Write protect bit in device specific parameter in the mode header based on the media in read only mode bit(03) in critical warning field returned in SMART/health log page. 3. The checking for volatile write cache is moved from SntiReturnAllModePages() to SntiTranslateModeSenseResponse() after successful completion of get log page command. We have tested the following: a. On a Read only NVMe SSD, install OFA driver with these changes. In the disk management tool, the status of disk is shown as Read Only. Please find attached "DiskMgmt.jpg" (sometimes requires a system restart after driver installation). b. Hot insert a RO NVMe SSD and observe status as Read Only in disk management tool. c. On NVMe SSD, which has less % of available spare(for example 10%), execute io meter tool with write commands. When available spare reaches 0%, the error count in io meter tools starts increasing(i.e. write commands fails with the sense data, as explained in above sections), and status becomes Read Only in disk management tool. d. After disk becomes RO, when we try to copy files to the RO drive, Windows show message "The disk is write protected". Please find attached "FileCopy.jpg" Note: a. As per NVMe spec 1.2, section 5.10.1.2, "There is not namespace specific information defined in the SMART / Health log page in this revision, thus the global log page and namespaces specific log page contain identical information". So when testing with multi namespace, when 1 namespace becomes RO, all the namespace will become RO. Spec has to be defined to have separate SMART /Health data per namespace. b. For testing, if there is no NVMe SSD which is in RO state, the following changes can to be made in the driver to test this feature: 1. In SntiTranslateModeSenseResponse(), hardcode pNvmeLogPage->CriticalWarning.MediaInReadOnlyMode to 1, before checking for the value. This can be done for per namespace also. 2. In SntiMapCompletionStatus(), for NVMe write command, hardcode statusCodeType to COMMAND_SPECIFIC_ERRORS and statusCode to 0x82. This can be done for per namespace also. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ~WRD000.jpg Type: image/jpeg Size: 823 bytes Desc: ~WRD000.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 2938 bytes Desc: image002.jpg URL: From suman.p at samsung.com Thu Apr 21 07:54:33 2016 From: suman.p at samsung.com (SUMAN PRAKASH B) Date: Thu, 21 Apr 2016 14:54:33 +0000 (GMT) Subject: [nvmewin] Patch with changes for disk Read only support Message-ID: <8D.6D.04892.9A9E8175@epcpsbgx2.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604212025909_Z5JE7EUA.jpg Type: image/jpeg Size: 2938 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604212025916_LK7CT9SZ.jpg Type: image/jpeg Size: 823 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604212025923_BSL8PYMC.gif Type: image/gif Size: 13168 bytes Desc: not available URL: From thomas.freeman at hgst.com Thu Apr 21 09:18:08 2016 From: thomas.freeman at hgst.com (Thomas Freeman) Date: Thu, 21 Apr 2016 16:18:08 +0000 Subject: [nvmewin] Patch with changes for disk Read only support In-Reply-To: <6D.6D.04892.8A9E8175@epcpsbgx2.samsung.com> References: <6D.6D.04892.8A9E8175@epcpsbgx2.samsung.com> Message-ID: Hi Suman, Thank you for the quick response. I agree with your comments. Thank you, Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image004.jpg at 01D19BBF.78987050] From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Thursday, April 21, 2016 9:55 AM To: Thomas Freeman ; nvmewin at lists.openfabrics.org Cc: anshul at samsung.com; prakash.v at samsung.com; MANOJ THAPLIYAL Subject: Re: RE: [nvmewin] Patch with changes for disk Read only support Hi Tom, Thanks for the comments. Please find my replies below: 1. For mode sense with 0x3f, you set the WP in the response header. Shouldn WP also be set if any of those pages(0x8, 0xa, 0x1a, 0x1c) are individually requested? [Suman] Below is our observations: a. During driver install/device enable from dev manager/hot insert, the mode pages 0x8 and 0x3f are invoked. b. During online/offline of disk, only 0x3f is invoked. So as per our understanding, during disk initialization, not all the mode pages will be called. But driver gets the mode page 0x3f every time during disk initialization. Also for Detection during run time, when driver returns SCSI_SENSE_DATA_PROTECT for sense data, driver gets the mode page 0x3f consistently. So we feel, setting the WP bit for mode page 0x3f will suffice. 2. snti.c:8242 [Suman] Agreed. We will use pLunExt->IsNamespaceReadOnly = TRUE; 3. Along with the new member, IsNamespaceReadOnly, the nvme_lun_extension also has ReadOnly. It seems like the setting of WP should take into account the value of both members. [Suman] The OFA driver supports only the first lba range type for a namespace, though spec supports 64 lba range types per NS. This has to be corrected first. Also we have to decide if the disk should be exposed as Read Only if any of the LBA range type is read only or only if the LBA 0 is read only. I feel this should be taken as a separate patch since this involves too many changes. 4. If the NVMe command Get Log Page fails, (SCT != Generic_command_status || SC != Successful completion), the buffer pSrbExt->pDatBuffer is not freed. This corresponds to the allocation at snti.c:6530. [Suman] Agreed. 5. snti.c:6539: I think the following can be eliminated. The same copy occurs during SntiTranslateModeSenseResponse - snti.c:8272. [Suman] Agreed. Please let us know your opinion. Regards, Suman ------- Original Message ------- Sender : Thomas Freeman Date : Apr 20, 2016 21:59 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for disk Read only support Hi Suman, After reviewing the code, I have a few questions/comments: 1. For mode sense with 0x3f, you set the WP in the response header. Shouldn WP also be set if any of those pages(0x8, 0xa, 0x1a, 0x1c) are individually requested? 2. snti.c:8242 Lun = pLunExt->namespaceId - 1; pDevExt->pLunExtensionTable[Lun]->IsNamespaceReadOnly = TRUE; These 2 lines can be replaced with pLunExt->IsNamespaceReadOnly = TRUE; Also, the original code is not a reliable way to determine the LUN id. Here is an example where there doesn't work. The device has attached namespaces 1,3 & 4 and Existing namespaces of 1, 2, 3 & 4. LUNs 0-3 will correspond to namespaces 1,3,4,2. For namespace 3, the calculation NSID-1=lun will incorrectly give you LUNid of 2. 3. Along with the new member, IsNamespaceReadOnly, the nvme_lun_extension also has ReadOnly. It seems like the setting of WP should take into account the value of both members. 4. If the NVMe command Get Log Page fails, (SCT != Generic_command_status || SC != Successful completion), the buffer pSrbExt->pDatBuffer is not freed. This corresponds to the allocation at snti.c:6530. 5. snti.c:6539: I think the following can be eliminated. The same copy occurs during SntiTranslateModeSenseResponse - snti.c:8272. if (GET_DATA_BUFFER(pSrb) != NULL) { StorPortCopyMemory((PVOID)GET_DATA_BUFFER(pSrb), (PVOID)(pSrbExt->modeSenseBuf), GET_DATA_LENGTH(pSrb)); } Let me know if you have questions, Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image002.jpg at 01D19BBF.789032F0] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Tuesday, April 19, 2016 8:45 AM To: nvmewin at lists.openfabrics.org Cc: sukka.kim at samsung.com; prakash.v at samsung.com; anshul at samsung.com; MANOJ THAPLIYAL ; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for disk Read only support Hi all, This patch includes changes for supporting NVMe Disk read only mode. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** NVMe Disk End of Life support: Whenever NVMe disk exhausts the P/E cycles, the disk become Read only(reaches End Of Life). In this case, the user should be able to read the data from the disk for backup or migration purpose. To achieve this, the driver should inform the kernel that disk has become read only. If driver does not inform the kernel, the disk will be unusable from Windows. The device has to be detected as Read only in following 2 scenarios – a. Detection during device hot plug When a Read only device is hot inserted, the kernel should be able to enumerate the device as Read only and alert the user accordingly. When the SSD is hot inserted, as part of disk initialization process, a SCSI mode sense command with page code ‘Return all pages’ (0x3f) is requested by the kernel. The mode page has a mode parameter header, which has a WP bit in the 'Device specific Parameter' field which indicates if the device is Write Protected for some reason. We can make use of this field to report to the kernel that the device has become Read only. When the miniport driver receives this request, the NVM Express command Get log page is built with log identifier 'SMART / Health Information' (0x2) and send to the device. The SMART data has a 'Critical Warning' field in which a bit 'MediaInReadOnlyMode' is set whenever the media becomes Read only. So if the device returns SMART data with this bit set, the miniport driver sets the Device specific parameter – WP bit in mode parameter header and completes the command. When the WP bit is set in the mode parameter header, the kernel will understand that the device is Write protected and hence kernel will not send any more write requests. b. Detection during run time When the device is in use and the Write exhausts and device becomes Read only, the kernel has to immediately report to the user that device has become write protected. To achieve this, whenever the device receives a NVMe Write request after it has become Read only, the device sets SCT to Command Specific Status and SC to 'Attempted Write to Read Only Range' in response to the write command. For this the following sense data is returned for the corresponding SCSI write command. Sense data – SCSI_SENSE_DATA_PROTECT, ASC – SCSI_ADSENSE_WRITE_PROTECT and ASCQ – SCSI_ADSENSE_NO_SENSE. With this sense data, the kernel will understand that the device is in Write protected state for which the Mode sense command with mode page 'Return all pages' will be send to the device. Again with the NVM Express Get log page – SMART command, the miniport driver will return the mode sense 'Data Specific parameter' accordingly. Code changes: 1. In SntiReturnAllModePages(), build get log page for SMART/health information and send to device. 2. In SntiTranslateModeSenseResponse(), for log page MODE_SENSE_RETURN_ALL, set the Write protect bit in device specific parameter in the mode header based on the media in read only mode bit(03) in critical warning field returned in SMART/health log page. 3. The checking for volatile write cache is moved from SntiReturnAllModePages() to SntiTranslateModeSenseResponse() after successful completion of get log page command. We have tested the following: a. On a Read only NVMe SSD, install OFA driver with these changes. In the disk management tool, the status of disk is shown as Read Only. Please find attached “DiskMgmt.jpg” (sometimes requires a system restart after driver installation). b. Hot insert a RO NVMe SSD and observe status as Read Only in disk management tool. c. On NVMe SSD, which has less % of available spare(for example 10%), execute io meter tool with write commands. When available spare reaches 0%, the error count in io meter tools starts increasing(i.e. write commands fails with the sense data, as explained in above sections), and status becomes Read Only in disk management tool. d. After disk becomes RO, when we try to copy files to the RO drive, Windows show message "The disk is write protected". Please find attached “FileCopy.jpg” Note: a. As per NVMe spec 1.2, section 5.10.1.2, "There is not namespace specific information defined in the SMART / Health log page in this revision, thus the global log page and namespaces specific log page contain identical information". So when testing with multi namespace, when 1 namespace becomes RO, all the namespace will become RO. Spec has to be defined to have separate SMART /Health data per namespace. b. For testing, if there is no NVMe SSD which is in RO state, the following changes can to be made in the driver to test this feature: 1. In SntiTranslateModeSenseResponse(), hardcode pNvmeLogPage->CriticalWarning.MediaInReadOnlyMode to 1, before checking for the value. This can be done for per namespace also. 2. In SntiMapCompletionStatus(), for NVMe write command, hardcode statusCodeType to COMMAND_SPECIFIC_ERRORS and statusCode to 0x82. This can be done for per namespace also. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [cid:image003.gif at 01D19BBF.789032F0] [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ~WRD000.jpg Type: image/jpeg Size: 823 bytes Desc: ~WRD000.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 2938 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 13168 bytes Desc: image003.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.jpg Type: image/jpeg Size: 2934 bytes Desc: image004.jpg URL: From carolyn.d.foster at intel.com Fri Apr 22 08:19:43 2016 From: carolyn.d.foster at intel.com (Foster, Carolyn D) Date: Fri, 22 Apr 2016 15:19:43 +0000 Subject: [nvmewin] Patch with changes for disk Read only support In-Reply-To: References: <6D.6D.04892.8A9E8175@epcpsbgx2.samsung.com> Message-ID: Hi Suman, I don’t have any additional feedback for this patch. Carolyn From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Thomas Freeman Sent: Thursday, April 21, 2016 9:18 AM To: suman.p at samsung.com; nvmewin at lists.openfabrics.org Subject: Re: [nvmewin] Patch with changes for disk Read only support Hi Suman, Thank you for the quick response. I agree with your comments. Thank you, Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image001.jpg at 01D19C6F.AB73C910] From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Thursday, April 21, 2016 9:55 AM To: Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: anshul at samsung.com; prakash.v at samsung.com; MANOJ THAPLIYAL > Subject: Re: RE: [nvmewin] Patch with changes for disk Read only support Hi Tom, Thanks for the comments. Please find my replies below: 1. For mode sense with 0x3f, you set the WP in the response header. Shouldn WP also be set if any of those pages(0x8, 0xa, 0x1a, 0x1c) are individually requested? [Suman] Below is our observations: a. During driver install/device enable from dev manager/hot insert, the mode pages 0x8 and 0x3f are invoked. b. During online/offline of disk, only 0x3f is invoked. So as per our understanding, during disk initialization, not all the mode pages will be called. But driver gets the mode page 0x3f every time during disk initialization. Also for Detection during run time, when driver returns SCSI_SENSE_DATA_PROTECT for sense data, driver gets the mode page 0x3f consistently. So we feel, setting the WP bit for mode page 0x3f will suffice. 2. snti.c:8242 [Suman] Agreed. We will use pLunExt->IsNamespaceReadOnly = TRUE; 3. Along with the new member, IsNamespaceReadOnly, the nvme_lun_extension also has ReadOnly. It seems like the setting of WP should take into account the value of both members. [Suman] The OFA driver supports only the first lba range type for a namespace, though spec supports 64 lba range types per NS. This has to be corrected first. Also we have to decide if the disk should be exposed as Read Only if any of the LBA range type is read only or only if the LBA 0 is read only. I feel this should be taken as a separate patch since this involves too many changes. 4. If the NVMe command Get Log Page fails, (SCT != Generic_command_status || SC != Successful completion), the buffer pSrbExt->pDatBuffer is not freed. This corresponds to the allocation at snti.c:6530. [Suman] Agreed. 5. snti.c:6539: I think the following can be eliminated. The same copy occurs during SntiTranslateModeSenseResponse - snti.c:8272. [Suman] Agreed. Please let us know your opinion. Regards, Suman ------- Original Message ------- Sender : Thomas Freeman> Date : Apr 20, 2016 21:59 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for disk Read only support Hi Suman, After reviewing the code, I have a few questions/comments: 1. For mode sense with 0x3f, you set the WP in the response header. Shouldn WP also be set if any of those pages(0x8, 0xa, 0x1a, 0x1c) are individually requested? 2. snti.c:8242 Lun = pLunExt->namespaceId - 1; pDevExt->pLunExtensionTable[Lun]->IsNamespaceReadOnly = TRUE; These 2 lines can be replaced with pLunExt->IsNamespaceReadOnly = TRUE; Also, the original code is not a reliable way to determine the LUN id. Here is an example where there doesn't work. The device has attached namespaces 1,3 & 4 and Existing namespaces of 1, 2, 3 & 4. LUNs 0-3 will correspond to namespaces 1,3,4,2. For namespace 3, the calculation NSID-1=lun will incorrectly give you LUNid of 2. 3. Along with the new member, IsNamespaceReadOnly, the nvme_lun_extension also has ReadOnly. It seems like the setting of WP should take into account the value of both members. 4. If the NVMe command Get Log Page fails, (SCT != Generic_command_status || SC != Successful completion), the buffer pSrbExt->pDatBuffer is not freed. This corresponds to the allocation at snti.c:6530. 5. snti.c:6539: I think the following can be eliminated. The same copy occurs during SntiTranslateModeSenseResponse - snti.c:8272. if (GET_DATA_BUFFER(pSrb) != NULL) { StorPortCopyMemory((PVOID)GET_DATA_BUFFER(pSrb), (PVOID)(pSrbExt->modeSenseBuf), GET_DATA_LENGTH(pSrb)); } Let me know if you have questions, Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image002.jpg at 01D19C6F.AB73C910] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Tuesday, April 19, 2016 8:45 AM To: nvmewin at lists.openfabrics.org Cc: sukka.kim at samsung.com; prakash.v at samsung.com; anshul at samsung.com; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for disk Read only support Hi all, This patch includes changes for supporting NVMe Disk read only mode. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** NVMe Disk End of Life support: Whenever NVMe disk exhausts the P/E cycles, the disk become Read only(reaches End Of Life). In this case, the user should be able to read the data from the disk for backup or migration purpose. To achieve this, the driver should inform the kernel that disk has become read only. If driver does not inform the kernel, the disk will be unusable from Windows. The device has to be detected as Read only in following 2 scenarios – a. Detection during device hot plug When a Read only device is hot inserted, the kernel should be able to enumerate the device as Read only and alert the user accordingly. When the SSD is hot inserted, as part of disk initialization process, a SCSI mode sense command with page code ‘Return all pages’ (0x3f) is requested by the kernel. The mode page has a mode parameter header, which has a WP bit in the 'Device specific Parameter' field which indicates if the device is Write Protected for some reason. We can make use of this field to report to the kernel that the device has become Read only. When the miniport driver receives this request, the NVM Express command Get log page is built with log identifier 'SMART / Health Information' (0x2) and send to the device. The SMART data has a 'Critical Warning' field in which a bit 'MediaInReadOnlyMode' is set whenever the media becomes Read only. So if the device returns SMART data with this bit set, the miniport driver sets the Device specific parameter – WP bit in mode parameter header and completes the command. When the WP bit is set in the mode parameter header, the kernel will understand that the device is Write protected and hence kernel will not send any more write requests. b. Detection during run time When the device is in use and the Write exhausts and device becomes Read only, the kernel has to immediately report to the user that device has become write protected. To achieve this, whenever the device receives a NVMe Write request after it has become Read only, the device sets SCT to Command Specific Status and SC to 'Attempted Write to Read Only Range' in response to the write command. For this the following sense data is returned for the corresponding SCSI write command. Sense data – SCSI_SENSE_DATA_PROTECT, ASC – SCSI_ADSENSE_WRITE_PROTECT and ASCQ – SCSI_ADSENSE_NO_SENSE. With this sense data, the kernel will understand that the device is in Write protected state for which the Mode sense command with mode page 'Return all pages' will be send to the device. Again with the NVM Express Get log page – SMART command, the miniport driver will return the mode sense 'Data Specific parameter' accordingly. Code changes: 1. In SntiReturnAllModePages(), build get log page for SMART/health information and send to device. 2. In SntiTranslateModeSenseResponse(), for log page MODE_SENSE_RETURN_ALL, set the Write protect bit in device specific parameter in the mode header based on the media in read only mode bit(03) in critical warning field returned in SMART/health log page. 3. The checking for volatile write cache is moved from SntiReturnAllModePages() to SntiTranslateModeSenseResponse() after successful completion of get log page command. We have tested the following: a. On a Read only NVMe SSD, install OFA driver with these changes. In the disk management tool, the status of disk is shown as Read Only. Please find attached “DiskMgmt.jpg” (sometimes requires a system restart after driver installation). b. Hot insert a RO NVMe SSD and observe status as Read Only in disk management tool. c. On NVMe SSD, which has less % of available spare(for example 10%), execute io meter tool with write commands. When available spare reaches 0%, the error count in io meter tools starts increasing(i.e. write commands fails with the sense data, as explained in above sections), and status becomes Read Only in disk management tool. d. After disk becomes RO, when we try to copy files to the RO drive, Windows show message "The disk is write protected". Please find attached “FileCopy.jpg” Note: a. As per NVMe spec 1.2, section 5.10.1.2, "There is not namespace specific information defined in the SMART / Health log page in this revision, thus the global log page and namespaces specific log page contain identical information". So when testing with multi namespace, when 1 namespace becomes RO, all the namespace will become RO. Spec has to be defined to have separate SMART /Health data per namespace. b. For testing, if there is no NVMe SSD which is in RO state, the following changes can to be made in the driver to test this feature: 1. In SntiTranslateModeSenseResponse(), hardcode pNvmeLogPage->CriticalWarning.MediaInReadOnlyMode to 1, before checking for the value. This can be done for per namespace also. 2. In SntiMapCompletionStatus(), for NVMe write command, hardcode statusCodeType to COMMAND_SPECIFIC_ERRORS and statusCode to 0x82. This can be done for per namespace also. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [cid:image005.gif at 01D19C6F.AB73C910] [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 2934 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 2938 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.jpg Type: image/jpeg Size: 823 bytes Desc: image004.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image005.gif Type: image/gif Size: 13168 bytes Desc: image005.gif URL: From suman.p at samsung.com Wed Apr 27 07:39:22 2016 From: suman.p at samsung.com (SUMAN PRAKASH B) Date: Wed, 27 Apr 2016 14:39:22 +0000 (GMT) Subject: [nvmewin] Patch with changes for disk Read only support Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604272010422_Z5JE7EUA.jpg Type: image/jpeg Size: 2934 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604272010473_LK7CT9SZ.jpg Type: image/jpeg Size: 2938 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604272010481_BSL8PYMC.jpg Type: image/jpeg Size: 823 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604272010487_4XEV4D4T.gif Type: image/gif Size: 13168 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Samsung_DiskReadOnlySupport_v2.7z Type: application/octet-stream Size: 145972 bytes Desc: not available URL: From suman.p at samsung.com Fri Apr 29 03:49:47 2016 From: suman.p at samsung.com (SUMAN PRAKASH B) Date: Fri, 29 Apr 2016 10:49:47 +0000 (GMT) Subject: [nvmewin] Patch with changes for disk Read only support Message-ID: <0F.71.05093.B4C33275@epcpsbgx4.samsung.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604291620536_Z5JE7EUA.jpg Type: image/jpeg Size: 2934 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604291620591_LK7CT9SZ.jpg Type: image/jpeg Size: 2938 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604291620596_BSL8PYMC.jpg Type: image/jpeg Size: 823 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 201604291620601_4XEV4D4T.gif Type: image/gif Size: 13168 bytes Desc: not available URL: From raymond.c.robles at intel.com Fri Apr 29 10:47:07 2016 From: raymond.c.robles at intel.com (Robles, Raymond C) Date: Fri, 29 Apr 2016 17:47:07 +0000 Subject: [nvmewin] Patch with changes for disk Read only support In-Reply-To: References: Message-ID: <49158E750348AA499168FD41D88983607C515EF3@fmsmsx117.amr.corp.intel.com> Hi Suman, I was waiting to hear back from Tom (HGST) on your latest changes. Once I get confirmation, we can close out this patch. Thanks… Ray From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Friday, April 29, 2016 3:50 AM To: Robles, Raymond C ; nvmewin at lists.openfabrics.org Cc: PRAKASH BABU VEMULA ; MANOJ THAPLIYAL ; ANSHUL SHARMA Subject: Re: Re: RE: [nvmewin] Patch with changes for disk Read only support Hi Ray, Since Intel and HGST have reviewed the code, and I have incorporated the review comments provided by HGST, can we close out this patch? We will be ready with our next patch by next week. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Apr 27, 2016 20:09 (GMT+05:30) Title : Re: RE: [nvmewin] Patch with changes for disk Read only support Hi All, I am sending the updated patch incorporating feedback from Tom. The changes are listed below. The password is samsungnvme 1. In SntiCompletionCallbackRoutine(), if the NVMe command Get Log Page fails, (SCT != Generic_command_status || SC != Successful completion), the buffer pSrbExt->pDatBuffer is freed. 2. In SntiTranslateModeSenseResponse(), replaced Lun = pLunExt->namespaceId - 1; pDevExt->pLunExtensionTable[Lun]->IsNamespaceReadOnly = TRUE; with pLunExt->IsNamespaceReadOnly = TRUE; 3. In SntiReturnAllModePages(), following code is removed, as the same copy occurs in SntiTranslateModeSenseResponse() if (GET_DATA_BUFFER(pSrb) != NULL) { StorPortCopyMemory((PVOID)GET_DATA_BUFFER(pSrb), (PVOID)(pSrbExt->modeSenseBuf), GET_DATA_LENGTH(pSrb)); } Please let me know if you have any questions. Thanks, Suman ------- Original Message ------- Sender : Foster, Carolyn D> Date : Apr 22, 2016 20:49 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for disk Read only support Hi Suman, I don’t have any additional feedback for this patch. Carolyn From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Thomas Freeman Sent: Thursday, April 21, 2016 9:18 AM To: suman.p at samsung.com; nvmewin at lists.openfabrics.org Subject: Re: [nvmewin] Patch with changes for disk Read only support Hi Suman, Thank you for the quick response. I agree with your comments. Thank you, Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image001.jpg at 01D1A204.797EFF70] From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Thursday, April 21, 2016 9:55 AM To: Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: anshul at samsung.com; prakash.v at samsung.com; MANOJ THAPLIYAL > Subject: Re: RE: [nvmewin] Patch with changes for disk Read only support Hi Tom, Thanks for the comments. Please find my replies below: 1. For mode sense with 0x3f, you set the WP in the response header. Shouldn WP also be set if any of those pages(0x8, 0xa, 0x1a, 0x1c) are individually requested? [Suman] Below is our observations: a. During driver install/device enable from dev manager/hot insert, the mode pages 0x8 and 0x3f are invoked. b. During online/offline of disk, only 0x3f is invoked. So as per our understanding, during disk initialization, not all the mode pages will be called. But driver gets the mode page 0x3f every time during disk initialization. Also for Detection during run time, when driver returns SCSI_SENSE_DATA_PROTECT for sense data, driver gets the mode page 0x3f consistently. So we feel, setting the WP bit for mode page 0x3f will suffice. 2. snti.c:8242 [Suman] Agreed. We will use pLunExt->IsNamespaceReadOnly = TRUE; 3. Along with the new member, IsNamespaceReadOnly, the nvme_lun_extension also has ReadOnly. It seems like the setting of WP should take into account the value of both members. [Suman] The OFA driver supports only the first lba range type for a namespace, though spec supports 64 lba range types per NS. This has to be corrected first. Also we have to decide if the disk should be exposed as Read Only if any of the LBA range type is read only or only if the LBA 0 is read only. I feel this should be taken as a separate patch since this involves too many changes. 4. If the NVMe command Get Log Page fails, (SCT != Generic_command_status || SC != Successful completion), the buffer pSrbExt->pDatBuffer is not freed. This corresponds to the allocation at snti.c:6530. [Suman] Agreed. 5. snti.c:6539: I think the following can be eliminated. The same copy occurs during SntiTranslateModeSenseResponse - snti.c:8272. [Suman] Agreed. Please let us know your opinion. Regards, Suman ------- Original Message ------- Sender : Thomas Freeman> Date : Apr 20, 2016 21:59 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for disk Read only support Hi Suman, After reviewing the code, I have a few questions/comments: 1. For mode sense with 0x3f, you set the WP in the response header. Shouldn WP also be set if any of those pages(0x8, 0xa, 0x1a, 0x1c) are individually requested? 2. snti.c:8242 Lun = pLunExt->namespaceId - 1; pDevExt->pLunExtensionTable[Lun]->IsNamespaceReadOnly = TRUE; These 2 lines can be replaced with pLunExt->IsNamespaceReadOnly = TRUE; Also, the original code is not a reliable way to determine the LUN id. Here is an example where there doesn't work. The device has attached namespaces 1,3 & 4 and Existing namespaces of 1, 2, 3 & 4. LUNs 0-3 will correspond to namespaces 1,3,4,2. For namespace 3, the calculation NSID-1=lun will incorrectly give you LUNid of 2. 3. Along with the new member, IsNamespaceReadOnly, the nvme_lun_extension also has ReadOnly. It seems like the setting of WP should take into account the value of both members. 4. If the NVMe command Get Log Page fails, (SCT != Generic_command_status || SC != Successful completion), the buffer pSrbExt->pDatBuffer is not freed. This corresponds to the allocation at snti.c:6530. 5. snti.c:6539: I think the following can be eliminated. The same copy occurs during SntiTranslateModeSenseResponse - snti.c:8272. if (GET_DATA_BUFFER(pSrb) != NULL) { StorPortCopyMemory((PVOID)GET_DATA_BUFFER(pSrb), (PVOID)(pSrbExt->modeSenseBuf), GET_DATA_LENGTH(pSrb)); } Let me know if you have questions, Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image002.jpg at 01D1A204.797EFF70] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Tuesday, April 19, 2016 8:45 AM To: nvmewin at lists.openfabrics.org Cc: sukka.kim at samsung.com; prakash.v at samsung.com; anshul at samsung.com; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for disk Read only support Hi all, This patch includes changes for supporting NVMe Disk read only mode. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** NVMe Disk End of Life support: Whenever NVMe disk exhausts the P/E cycles, the disk become Read only(reaches End Of Life). In this case, the user should be able to read the data from the disk for backup or migration purpose. To achieve this, the driver should inform the kernel that disk has become read only. If driver does not inform the kernel, the disk will be unusable from Windows. The device has to be detected as Read only in following 2 scenarios – a. Detection during device hot plug When a Read only device is hot inserted, the kernel should be able to enumerate the device as Read only and alert the user accordingly. When the SSD is hot inserted, as part of disk initialization process, a SCSI mode sense command with page code ‘Return all pages’ (0x3f) is requested by the kernel. The mode page has a mode parameter header, which has a WP bit in the 'Device specific Parameter' field which indicates if the device is Write Protected for some reason. We can make use of this field to report to the kernel that the device has become Read only. When the miniport driver receives this request, the NVM Express command Get log page is built with log identifier 'SMART / Health Information' (0x2) and send to the device. The SMART data has a 'Critical Warning' field in which a bit 'MediaInReadOnlyMode' is set whenever the media becomes Read only. So if the device returns SMART data with this bit set, the miniport driver sets the Device specific parameter – WP bit in mode parameter header and completes the command. When the WP bit is set in the mode parameter header, the kernel will understand that the device is Write protected and hence kernel will not send any more write requests. b. Detection during run time When the device is in use and the Write exhausts and device becomes Read only, the kernel has to immediately report to the user that device has become write protected. To achieve this, whenever the device receives a NVMe Write request after it has become Read only, the device sets SCT to Command Specific Status and SC to 'Attempted Write to Read Only Range' in response to the write command. For this the following sense data is returned for the corresponding SCSI write command. Sense data – SCSI_SENSE_DATA_PROTECT, ASC – SCSI_ADSENSE_WRITE_PROTECT and ASCQ – SCSI_ADSENSE_NO_SENSE. With this sense data, the kernel will understand that the device is in Write protected state for which the Mode sense command with mode page 'Return all pages' will be send to the device. Again with the NVM Express Get log page – SMART command, the miniport driver will return the mode sense 'Data Specific parameter' accordingly. Code changes: 1. In SntiReturnAllModePages(), build get log page for SMART/health information and send to device. 2. In SntiTranslateModeSenseResponse(), for log page MODE_SENSE_RETURN_ALL, set the Write protect bit in device specific parameter in the mode header based on the media in read only mode bit(03) in critical warning field returned in SMART/health log page. 3. The checking for volatile write cache is moved from SntiReturnAllModePages() to SntiTranslateModeSenseResponse() after successful completion of get log page command. We have tested the following: a. On a Read only NVMe SSD, install OFA driver with these changes. In the disk management tool, the status of disk is shown as Read Only. Please find attached “DiskMgmt.jpg” (sometimes requires a system restart after driver installation). b. Hot insert a RO NVMe SSD and observe status as Read Only in disk management tool. c. On NVMe SSD, which has less % of available spare(for example 10%), execute io meter tool with write commands. When available spare reaches 0%, the error count in io meter tools starts increasing(i.e. write commands fails with the sense data, as explained in above sections), and status becomes Read Only in disk management tool. d. After disk becomes RO, when we try to copy files to the RO drive, Windows show message "The disk is write protected". Please find attached “FileCopy.jpg” Note: a. As per NVMe spec 1.2, section 5.10.1.2, "There is not namespace specific information defined in the SMART / Health log page in this revision, thus the global log page and namespaces specific log page contain identical information". So when testing with multi namespace, when 1 namespace becomes RO, all the namespace will become RO. Spec has to be defined to have separate SMART /Health data per namespace. b. For testing, if there is no NVMe SSD which is in RO state, the following changes can to be made in the driver to test this feature: 1. In SntiTranslateModeSenseResponse(), hardcode pNvmeLogPage->CriticalWarning.MediaInReadOnlyMode to 1, before checking for the value. This can be done for per namespace also. 2. In SntiMapCompletionStatus(), for NVMe write command, hardcode statusCodeType to COMMAND_SPECIFIC_ERRORS and statusCode to 0x82. This can be done for per namespace also. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [cid:image004.gif at 01D1A204.797EFF70] [http://ext.samsung.net/mailcheck/SeenTimeChecker?do=88b6a78ac05616df4de01cc6a410acfcc083c520cd24e66b313cb48408a29da1bbef98290b9ea6844a835b19b8ec809fc7b41e955949e5c8a728c55b39cc59eacf878f9a26ce15a0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 2934 bytes Desc: image001.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.jpg Type: image/jpeg Size: 2938 bytes Desc: image002.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.jpg Type: image/jpeg Size: 823 bytes Desc: image003.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image004.gif Type: image/gif Size: 13168 bytes Desc: image004.gif URL: From thomas.freeman at hgst.com Fri Apr 29 10:52:26 2016 From: thomas.freeman at hgst.com (Thomas Freeman) Date: Fri, 29 Apr 2016 17:52:26 +0000 Subject: [nvmewin] Patch with changes for disk Read only support In-Reply-To: <49158E750348AA499168FD41D88983607C515EF3@fmsmsx117.amr.corp.intel.com> References: <49158E750348AA499168FD41D88983607C515EF3@fmsmsx117.amr.corp.intel.com> Message-ID: I accept the change. Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image001.jpg at 01D1A215.F8B995A0] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Robles, Raymond C Sent: Friday, April 29, 2016 12:47 PM To: suman.p at samsung.com; nvmewin at lists.openfabrics.org Cc: PRAKASH BABU VEMULA ; ANSHUL SHARMA ; MANOJ THAPLIYAL Subject: Re: [nvmewin] Patch with changes for disk Read only support Hi Suman, I was waiting to hear back from Tom (HGST) on your latest changes. Once I get confirmation, we can close out this patch. Thanks… Ray From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Friday, April 29, 2016 3:50 AM To: Robles, Raymond C ; nvmewin at lists.openfabrics.org Cc: PRAKASH BABU VEMULA ; MANOJ THAPLIYAL ; ANSHUL SHARMA Subject: Re: Re: RE: [nvmewin] Patch with changes for disk Read only support Hi Ray, Since Intel and HGST have reviewed the code, and I have incorporated the review comments provided by HGST, can we close out this patch? We will be ready with our next patch by next week. Thanks, Suman ------- Original Message ------- Sender : SUMAN PRAKASH B> Senior Chief Engineer/SSIR-SSD Solutions/Samsung Electronics Date : Apr 27, 2016 20:09 (GMT+05:30) Title : Re: RE: [nvmewin] Patch with changes for disk Read only support Hi All, I am sending the updated patch incorporating feedback from Tom. The changes are listed below. The password is samsungnvme 1. In SntiCompletionCallbackRoutine(), if the NVMe command Get Log Page fails, (SCT != Generic_command_status || SC != Successful completion), the buffer pSrbExt->pDatBuffer is freed. 2. In SntiTranslateModeSenseResponse(), replaced Lun = pLunExt->namespaceId - 1; pDevExt->pLunExtensionTable[Lun]->IsNamespaceReadOnly = TRUE; with pLunExt->IsNamespaceReadOnly = TRUE; 3. In SntiReturnAllModePages(), following code is removed, as the same copy occurs in SntiTranslateModeSenseResponse() if (GET_DATA_BUFFER(pSrb) != NULL) { StorPortCopyMemory((PVOID)GET_DATA_BUFFER(pSrb), (PVOID)(pSrbExt->modeSenseBuf), GET_DATA_LENGTH(pSrb)); } Please let me know if you have any questions. Thanks, Suman ------- Original Message ------- Sender : Foster, Carolyn D> Date : Apr 22, 2016 20:49 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for disk Read only support Hi Suman, I don’t have any additional feedback for this patch. Carolyn From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Thomas Freeman Sent: Thursday, April 21, 2016 9:18 AM To: suman.p at samsung.com; nvmewin at lists.openfabrics.org Subject: Re: [nvmewin] Patch with changes for disk Read only support Hi Suman, Thank you for the quick response. I agree with your comments. Thank you, Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image005.jpg at 01D1A215.F8B242A0] From: SUMAN PRAKASH B [mailto:suman.p at samsung.com] Sent: Thursday, April 21, 2016 9:55 AM To: Thomas Freeman >; nvmewin at lists.openfabrics.org Cc: anshul at samsung.com; prakash.v at samsung.com; MANOJ THAPLIYAL > Subject: Re: RE: [nvmewin] Patch with changes for disk Read only support Hi Tom, Thanks for the comments. Please find my replies below: 1. For mode sense with 0x3f, you set the WP in the response header. Shouldn WP also be set if any of those pages(0x8, 0xa, 0x1a, 0x1c) are individually requested? [Suman] Below is our observations: a. During driver install/device enable from dev manager/hot insert, the mode pages 0x8 and 0x3f are invoked. b. During online/offline of disk, only 0x3f is invoked. So as per our understanding, during disk initialization, not all the mode pages will be called. But driver gets the mode page 0x3f every time during disk initialization. Also for Detection during run time, when driver returns SCSI_SENSE_DATA_PROTECT for sense data, driver gets the mode page 0x3f consistently. So we feel, setting the WP bit for mode page 0x3f will suffice. 2. snti.c:8242 [Suman] Agreed. We will use pLunExt->IsNamespaceReadOnly = TRUE; 3. Along with the new member, IsNamespaceReadOnly, the nvme_lun_extension also has ReadOnly. It seems like the setting of WP should take into account the value of both members. [Suman] The OFA driver supports only the first lba range type for a namespace, though spec supports 64 lba range types per NS. This has to be corrected first. Also we have to decide if the disk should be exposed as Read Only if any of the LBA range type is read only or only if the LBA 0 is read only. I feel this should be taken as a separate patch since this involves too many changes. 4. If the NVMe command Get Log Page fails, (SCT != Generic_command_status || SC != Successful completion), the buffer pSrbExt->pDatBuffer is not freed. This corresponds to the allocation at snti.c:6530. [Suman] Agreed. 5. snti.c:6539: I think the following can be eliminated. The same copy occurs during SntiTranslateModeSenseResponse - snti.c:8272. [Suman] Agreed. Please let us know your opinion. Regards, Suman ------- Original Message ------- Sender : Thomas Freeman> Date : Apr 20, 2016 21:59 (GMT+05:30) Title : RE: [nvmewin] Patch with changes for disk Read only support Hi Suman, After reviewing the code, I have a few questions/comments: 1. For mode sense with 0x3f, you set the WP in the response header. Shouldn WP also be set if any of those pages(0x8, 0xa, 0x1a, 0x1c) are individually requested? 2. snti.c:8242 Lun = pLunExt->namespaceId - 1; pDevExt->pLunExtensionTable[Lun]->IsNamespaceReadOnly = TRUE; These 2 lines can be replaced with pLunExt->IsNamespaceReadOnly = TRUE; Also, the original code is not a reliable way to determine the LUN id. Here is an example where there doesn't work. The device has attached namespaces 1,3 & 4 and Existing namespaces of 1, 2, 3 & 4. LUNs 0-3 will correspond to namespaces 1,3,4,2. For namespace 3, the calculation NSID-1=lun will incorrectly give you LUNid of 2. 3. Along with the new member, IsNamespaceReadOnly, the nvme_lun_extension also has ReadOnly. It seems like the setting of WP should take into account the value of both members. 4. If the NVMe command Get Log Page fails, (SCT != Generic_command_status || SC != Successful completion), the buffer pSrbExt->pDatBuffer is not freed. This corresponds to the allocation at snti.c:6530. 5. snti.c:6539: I think the following can be eliminated. The same copy occurs during SntiTranslateModeSenseResponse - snti.c:8272. if (GET_DATA_BUFFER(pSrb) != NULL) { StorPortCopyMemory((PVOID)GET_DATA_BUFFER(pSrb), (PVOID)(pSrbExt->modeSenseBuf), GET_DATA_LENGTH(pSrb)); } Let me know if you have questions, Tom Freeman Software Engineer, Device Manager and Driver Development Western Digital Corporation e. Thomas.freeman at hgst.com o. +1-507-322-2311 [cid:image006.jpg at 01D1A215.F8B242A0] From: nvmewin [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of SUMAN PRAKASH B Sent: Tuesday, April 19, 2016 8:45 AM To: nvmewin at lists.openfabrics.org Cc: sukka.kim at samsung.com; prakash.v at samsung.com; anshul at samsung.com; MANOJ THAPLIYAL >; tru.nguyen at ssi.samsung.com Subject: [nvmewin] Patch with changes for disk Read only support Hi all, This patch includes changes for supporting NVMe Disk read only mode. I have made a detailed overview of the changes in the attached doc file(the contents are also copied here below) and the attached zip file contains the source code. Password is samsungnvme Please let me know if you have any questions. Thanks, Suman ****************** NVMe Disk End of Life support: Whenever NVMe disk exhausts the P/E cycles, the disk become Read only(reaches End Of Life). In this case, the user should be able to read the data from the disk for backup or migration purpose. To achieve this, the driver should inform the kernel that disk has become read only. If driver does not inform the kernel, the disk will be unusable from Windows. The device has to be detected as Read only in following 2 scenarios – a. Detection during device hot plug When a Read only device is hot inserted, the kernel should be able to enumerate the device as Read only and alert the user accordingly. When the SSD is hot inserted, as part of disk initialization process, a SCSI mode sense command with page code ‘Return all pages’ (0x3f) is requested by the kernel. The mode page has a mode parameter header, which has a WP bit in the 'Device specific Parameter' field which indicates if the device is Write Protected for some reason. We can make use of this field to report to the kernel that the device has become Read only. When the miniport driver receives this request, the NVM Express command Get log page is built with log identifier 'SMART / Health Information' (0x2) and send to the device. The SMART data has a 'Critical Warning' field in which a bit 'MediaInReadOnlyMode' is set whenever the media becomes Read only. So if the device returns SMART data with this bit set, the miniport driver sets the Device specific parameter – WP bit in mode parameter header and completes the command. When the WP bit is set in the mode parameter header, the kernel will understand that the device is Write protected and hence kernel will not send any more write requests. b. Detection during run time When the device is in use and the Write exhausts and device becomes Read only, the kernel has to immediately report to the user that device has become write protected. To achieve this, whenever the device receives a NVMe Write request after it has become Read only, the device sets SCT to Command Specific Status and SC to 'Attempted Write to Read Only Range' in response to the write command. For this the following sense data is returned for the corresponding SCSI write command. Sense data – SCSI_SENSE_DATA_PROTECT, ASC – SCSI_ADSENSE_WRITE_PROTECT and ASCQ – SCSI_ADSENSE_NO_SENSE. With this sense data, the kernel will understand that the device is in Write protected state for which the Mode sense command with mode page 'Return all pages' will be send to the device. Again with the NVM Express Get log page – SMART command, the miniport driver will return the mode sense 'Data Specific parameter' accordingly. Code changes: 1. In SntiReturnAllModePages(), build get log page for SMART/health information and send to device. 2. In SntiTranslateModeSenseResponse(), for log page MODE_SENSE_RETURN_ALL, set the Write protect bit in device specific parameter in the mode header based on the media in read only mode bit(03) in critical warning field returned in SMART/health log page. 3. The checking for volatile write cache is moved from SntiReturnAllModePages() to SntiTranslateModeSenseResponse() after successful completion of get log page command. We have tested the following: a. On a Read only NVMe SSD, install OFA driver with these changes. In the disk management tool, the status of disk is shown as Read Only. Please find attached “DiskMgmt.jpg” (sometimes requires a system restart after driver installation). b. Hot insert a RO NVMe SSD and observe status as Read Only in disk management tool. c. On NVMe SSD, which has less % of available spare(for example 10%), execute io meter tool with write commands. When available spare reaches 0%, the error count in io meter tools starts increasing(i.e. write commands fails with the sense data, as explained in above sections), and status becomes Read Only in disk management tool. d. After disk becomes RO, when we try to copy files to the RO drive, Windows show message "The disk is write protected". Please find attached “FileCopy.jpg” Note: a. As per NVMe spec 1.2, section 5.10.1.2, "There is not namespace specific information defined in the SMART / Health log page in this revision, thus the global log page and namespaces specific log page contain identical information". So when testing with multi namespace, when 1 namespace becomes RO, all the namespace will become RO. Spec has to be defined to have separate SMART /Health data per namespace. b. For testing, if there is no NVMe SSD which is in RO state, the following changes can to be made in the driver to test this feature: 1. In SntiTranslateModeSenseResponse(), hardcode pNvmeLogPage->CriticalWarning.MediaInReadOnlyMode to 1, before checking for the value. This can be done for per namespace also. 2. In SntiMapCompletionStatus(), for NVMe write command, hardcode statusCodeType to COMMAND_SPECIFIC_ERRORS and statusCode to 0x82. This can be done for per namespace also. [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. [cid:image007.gif at 01D1A215.F8B242A0] [Image removed by sender.] Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer: This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ~WRD000.jpg Type: image/jpeg Size: 823 bytes Desc: ~WRD000.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image005.jpg Type: image/jpeg Size: 2934 bytes Desc: image005.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image006.jpg Type: image/jpeg Size: 2938 bytes Desc: image006.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image007.gif Type: image/gif Size: 13168 bytes Desc: image007.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 2934 bytes Desc: image001.jpg URL: