[nvmewin] Handling IO when Format NVM op is in progress
Judy Brock-SSI
judy.brock at ssi.samsung.com
Wed Jan 8 16:24:45 PST 2014
Hello,
Would (I'm wondering if we know for sure or not at this point) returning NOT READY/FORMAT IN PROGRESS while format op is underway and then notifying Storport that a bus change has occurred (and thus re-enumeration is required) be enough to signal to the upper storage stack layers that they need to refresh their view of relevant block devices and their properties (capacity, initialized/uninitialized, etc)? What if no IOs come down during the format op?
If not, can someone elaborate on how the miniport would implement the following?
[Barrett wrote]: It is possible to get partmgr, disk and upper storage layers (ex. VDS) to recognize capacity (and I assume physical geometry but have never actually tested it) changes to a LUN using IOCTL_DISK_UPDATE_PROPERTIES and IOCTL_DISK_UPDATE_DRIVE_SIZE. From a storage driver point of view, they need to be sent to the top of the driver stack and getting the necessary PDOs in a storport miniport requires some extra work.
Thanks,
Judy
From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Mayes, Barrett N
Sent: Tuesday, January 07, 2014 8:42 PM
To: Jeff Glass; nvmewin at lists.openfabrics.org
Subject: Re: [nvmewin] Handling IO when Format NVM op is in progress
It is possible to get partmgr, disk and upper storage layers (ex. VDS) to recognize capacity (and I assume physical geometry but have never actually tested it) changes to a LUN using IOCTL_DISK_UPDATE_PROPERTIES and IOCTL_DISK_UPDATE_DRIVE_SIZE. From a storage driver point of view, they need to be sent to the top of the driver stack and getting the necessary PDOs in a storport miniport requires some extra work. They can be more easily be sent from a user-mode app such as the one that might initiate the format in the first place. But I agree, it is desirable to provide a consistent experience across devices and that would be more difficult if you had to coordinate with various 3rd party tools.
From: nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org> [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Jeff Glass
Sent: Tuesday, January 07, 2014 8:13 PM
To: nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: Re: [nvmewin] Handling IO when Format NVM op is in progress
Unfortunately the Windows storage stack does not recognize the unit attention code (capacity data has changed) (at least it didn't in Server 2012) that could be used to report a change in device capacity, so to get the system to rescan the device to determine the capacity has changed a BusChangedDetected needs to be reported.
In my experience, SCSI RAID controller or HBA driver's don't do anything to get the O/S to recognize the change which in turn requires the user to manually disable and re-enable the device to get Windows to recognize that the device's capacity has changed. The NVMe driver is in the position to provide a better experience that is consistent across hardware for all manufacturer's by eliminating the need for manual user intervention.
Jeff
On 1/7/2014 8:44 AM, Luse, Paul E wrote:
Wrt Judy's (a) below, So I believe the original concern that drove us to the implementation was that w/NVMe the block size can be changed with a format whereas that can't happen with a SCSI format... I could be mistaken but on a quick scan of the email threads I didn't see that point mentioned. We felt like the only way to get the upper layers to discover the potentially changed block size was to tear it down/bring it back up
Thx
Paul
From: nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org> [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Neal Galbo (ngalbo)
Sent: Tuesday, January 7, 2014 7:27 AM
To: Mayes, Barrett N; Judy Brock-SSI; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: Re: [nvmewin] Handling IO when Format NVM op is in progress
The conditions you describe would never happen with a SCSI device. In general, LU's and LUN's never disappear in SCSI; they are not transient. They are not dynamic. They are static. They either exist or they don't. They don't hide, once enumerated/attached/located. Unlike namespaces.
The media, backing storage or provisioning can change, but the LU would always be available for communication (commands). Other LU's in the same device would not be affected either - they are independent entities relative to each other.
-Neal
From: nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org> [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Mayes, Barrett N
Sent: Tuesday, January 07, 2014 12:15 AM
To: Judy Brock-SSI; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: Re: [nvmewin] Handling IO when Format NVM op is in progress
What problem do you want to solve by keeping the block device around during a format and allowing IO through so it can be failed with check condition?
Namespace not ready can't generically be translated to SCSI check condition/not ready/Format In Progress. The driver would need to know that a format command is outstanding so it could translate that correctly (for 1.0-based device support) since the namespace could be not ready for reasons other than a format in progress. If the driver already has to know a format is in progress, it could just fail commands without sending it to the device (so no need for the new failure code). But if that's the case, why _not_ hide the LUN until the format is complete. By hiding the LUN and bringing it back when the format is complete, you don't have to worry about handling IO and you also take care of the re-enumeration that has to happen when the format is complete (in the case of changing the LBA Format).
-Barrett
From: nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org> [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Judy Brock-SSI
Sent: Monday, January 06, 2014 8:09 PM
To: nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: [nvmewin] Handling IO when Format NVM op is in progress
All,
Many months ago, I initiated a thread (see attached) in which I argued that there were some holes in the current implementation of Format NVM ioctl passthru and in which I advocated for, among other things, the addition of logic to make sure pseudo-SCSI bus re-enumeration had fully taken place in the driver (such that Storport was notified that no "luns" were present) before the actual Format NVM op was launched.
I intuitively understand - and up to this point have unquestioningly agreed with - the basic assumption that the reason the namespaces must be removed/ "luns" disappeared prior to formatting is because "we cannot format a namespace while the OS is aware of its presence and could be potentially sending I/O to a stale namespace config (i.e. changing LBA/sector size)." (excerpt from attached thread).
The question has recently arisen in internal discussion however as to whether or not we really have to do this. It was pointed out that SCSI devices (real ones) are capable of receiving IO commands while SCSI format commands are in process. They will return the following error:
SCSI status = Check condition, sense code = NOT READY (0x2), additional sense code = LUN NOT READY (0x4), additional sense code qualifier = FORMAT IN PROGRESS (0x4)
Why then, instead of removing namespaces/luns, can our Windows driver not return the same error status a real SCSI drive would return in such a situation? One would assume that upper layers of the storage stack have plenty years of experience in knowing what to do when it sees that error.
As a point of comparison, there is no standard I am aware of which specifies that Storport miniports which support real SCSI devices, if they happen to provide a proprietary pass-thru to allow a SCSI format command to go through to a device, must cause all LUNS to appear offline prior to formatting,
One could even argue (and they have!) that these IO commands could even be allowed to go through to the NVMe device itself (as in the real SCSI case); NVMe 1.1 Technical Proposal 005 has defined a new format-in-progress status code that NVMe firmware will be able to return at some point in the future, current firmware could easily return NAMESPACE_NOT_READY and driver could translate to the above SCSI sense data, etc.
So ... here I stand, devil's advocate hat in hand, hoping to find out:
a) what the "back story" is on how this decision was ultimately made (the attached thread said a lot of discussion took place on the subject)
b) whether or not the diametrically-opposed alternative I am discussing above was thoroughly considered and if so, why it was rejected
c) whether the topic bears reconsidering at this point.
Thanks in advance for your collective consideration,
Judy
_______________________________________________
nvmewin mailing list
nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/nvmewin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20140109/30fe0f9a/attachment.html>
More information about the nvmewin
mailing list