[nvmewin] Handling IO when Format NVM op is in progress

Judy Brock-SSI judy.brock at ssi.samsung.com
Tue Jan 7 21:33:54 PST 2014


  >>For 1, why not just fail IO with CHECK_CONDITION NOT_READY FORMAT_IN_PROGRESS

That is my precise proposal :)

  >> For 2, can you just use IoInvalidateDeviceRelations() after the format and allow PNP to re-enumerate the bus and all devices?

The Storport miniport should confine itself to using Storport APIs. Also, that API is designed for usage by bus drivers such as PCI.SYS, the Windows PCI bus driver.  The Windows NVMe Storport miniport is not a bus driver.

Thanks,
Judy

From: Speer, Kenny [mailto:Kenny.Speer at netapp.com]
Sent: Tuesday, January 07, 2014 9:21 PM
To: Judy Brock-SSI; Mayes, Barrett N; Neal Galbo (ngalbo); nvmewin at lists.openfabrics.org
Subject: RE: Handling IO when Format NVM op is in progress

I'm not involved in this project yet, but am tracking it so forgive my ignorance.

It seems to me you have two goals:

1.       Fail IO that is sent during format (which it doesn't seem that you've agreed on the method here) and you can't control what an application may attempt

2.       Notify windows that the device geometry has changed

For 1, why not just fail IO with CHECK_CONDITION NOT_READY FORMAT_IN_PROGRESS
For 2, can you just use IoInvalidateDeviceRelations() after the format and allow PNP to re-enumerate the bus and all devices?

Alternatively, the idea of removing the device while it is inaccessible is not a bad idea and SCSI devices are transient in some scenarios (VSS use cases for instance).

Somebody mentioned READ_CAP_DATA_CHANGED not working in 2012, while off topic, I have not seen that issue in other enviornments.

From: nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org> [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Judy Brock-SSI
Sent: Tuesday, January 7, 2014 8:48 PM
To: Mayes, Barrett N; Neal Galbo (ngalbo); nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: Re: [nvmewin] Handling IO when Format NVM op is in progress

Hi Barrett,

You make good points below.

  >> If current code isn't working as intended, it is a bug and should be fixed.

It is - and it is partially in revisiting the work involved in fixing the current code that the discussion took the turn it did internally :).

   >> If there is a compelling reason to change the intended behavior, let's discuss.

I think we're all in agreement with the goal - to alert the upper layers to potential geometric changes, etc. It's just a matter of the simplest way to get there. I think it would be good to rediscuss as we've started to do here. We need to fix the code in any case so we may or may not come to the conclusion that the best way to do it is with the current design (as it was intended to work) or not.

Thanks,
Judy

From: Mayes, Barrett N [mailto:barrett.n.mayes at intel.com]
Sent: Tuesday, January 07, 2014 8:29 PM
To: Judy Brock-SSI; Neal Galbo (ngalbo); nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: Handling IO when Format NVM op is in progress

There is no definition in NVMe spec or SCSI to NVMe translation reference doc that defines namespaces as either static or transient.  I would argue they are transient because they can be created, destroyed, resized, and have their physical properties changed.  But I think it is fair to say it is undefined.

There is no requirement in current specs for Format NVM command to not change the number of namespaces.  Given namespace management in 1.0 and 1.1 is vendor specific, it's conceivable a device might leverage secure erase to reset namespaces to a default/factory config.

In NVMe, the adapter/controller is the static object and management commands are directed towards the admin queue that is associated with that controller.

If current code isn't working as intended, it is a bug and should be fixed.  If there is a compelling reason to change the intended behavior, let's discuss.

From: Judy Brock-SSI [mailto:judy.brock at ssi.samsung.com]
Sent: Tuesday, January 07, 2014 6:42 PM
To: Neal Galbo (ngalbo); Mayes, Barrett N; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: Handling IO when Format NVM op is in progress

    >>In general, LU's and LUN's never disappear in SCSI; they are not transient. They are not dynamic. They are static. They either exist or they don't. They don't hide, once enumerated/attached/located. Unlike namespaces.

Like SCSI LUNs, NVMe namespaces are also not transient and (I now submit, having reversed positions :)) that they should not hide either.

Format NVM command does not result in a different number of Namespaces then existed before the operation after it is finished.  While namespaces can be formatted with different LBA Format than previously, they are still static in terms of their existence/non-existence.

Additionally , the current code is not achieving what it intended to do since it does not actually hide any LUNS before launching the Format NVM op. That is it does not  wait till the existing LUNs are "gone" (ie until the miniport fails to report them  in a subsequent inquiry) before it starts the operation.

Thanks,
Judy

From: Neal Galbo (ngalbo) [mailto:ngalbo at micron.com]
Sent: Tuesday, January 07, 2014 6:27 AM
To: Mayes, Barrett N; Judy Brock-SSI; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: Handling IO when Format NVM op is in progress

The conditions you describe would never happen with a SCSI device. In general, LU's and LUN's never disappear in SCSI; they are not transient. They are not dynamic. They are static. They either exist or they don't. They don't hide, once enumerated/attached/located. Unlike namespaces.

The media, backing storage or provisioning can change, but the LU would always be available for communication (commands). Other LU's in the same device would not be affected either - they are independent entities relative to each other.

-Neal

From: nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org> [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Mayes, Barrett N
Sent: Tuesday, January 07, 2014 12:15 AM
To: Judy Brock-SSI; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: Re: [nvmewin] Handling IO when Format NVM op is in progress

What problem do you want to solve by keeping the block device around during a format and allowing IO through so it can be failed with check condition?

Namespace not ready can't generically be translated to SCSI check condition/not ready/Format In Progress.  The driver would need to know that a format command is outstanding so it could translate that correctly (for 1.0-based device support) since the namespace could be not ready for reasons other than a format in progress.  If the driver already has to know a format is in progress, it could just fail commands without sending it to the device (so no need for the new failure code).  But if that's the case, why _not_ hide the LUN until the format is complete.  By hiding the LUN and bringing it back when the format is complete, you don't have to worry about handling IO and you also take care of the re-enumeration that has to happen when the format is complete (in the case of changing the LBA Format).

-Barrett

From: nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org> [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Judy Brock-SSI
Sent: Monday, January 06, 2014 8:09 PM
To: nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: [nvmewin] Handling IO when Format NVM op is in progress

All,

Many months ago, I initiated a thread  (see attached) in which I argued that there were some holes in the current implementation of Format NVM ioctl passthru and in which I advocated for, among other things, the addition of logic to make sure pseudo-SCSI bus re-enumeration had fully taken place in the driver (such that Storport was notified that no "luns" were present) before the actual Format NVM op was launched.

I intuitively understand - and up to this point have unquestioningly agreed with - the basic assumption that the reason the namespaces must be  removed/ "luns" disappeared prior to formatting is because "we cannot format a namespace while the OS is aware of its presence and could be potentially sending I/O to a stale namespace config (i.e. changing LBA/sector size)." (excerpt from attached thread).

The question has recently arisen in internal discussion however as to whether or not we really have to do this. It was pointed out that SCSI devices (real ones) are capable of receiving IO commands while SCSI format commands are in process. They will return the following error:

      SCSI status =   Check condition, sense code = NOT READY (0x2), additional sense code = LUN NOT READY (0x4), additional sense code qualifier = FORMAT IN PROGRESS (0x4)

Why then, instead of removing namespaces/luns, can our Windows driver not return the same error status a real SCSI drive would return in such a situation? One would assume that upper layers of the storage stack have plenty years of experience in knowing what to do when it sees that error.

As a point of comparison, there is no standard I am aware of which specifies that Storport miniports which support real SCSI devices, if they happen to provide a proprietary pass-thru to allow a SCSI format command to go through to a device, must cause all LUNS to appear offline prior to formatting,

One could even argue (and they have!) that these IO commands could even be allowed to go through to the NVMe device itself (as in the real SCSI case); NVMe 1.1 Technical Proposal 005 has defined a new format-in-progress status code that NVMe firmware will be able to return at some point in the future, current firmware could easily return NAMESPACE_NOT_READY and driver could translate to the above SCSI sense data, etc.

So ...  here I stand, devil's advocate hat in hand, hoping to find out:


a)       what the "back story" is on how this decision was ultimately made (the attached thread said a lot of discussion took place on the subject)

b)       whether or not the diametrically-opposed alternative I am discussing above was thoroughly considered and if so, why it was rejected

c)       whether the topic bears reconsidering at this point.

Thanks in advance for your collective consideration,

Judy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20140108/b7be0e2c/attachment.html>


More information about the nvmewin mailing list