[nvmewin] Handling pending commands when processing Format [changing from NVMe WG dist. list to OFA NVMe Windows Driver dist. list]

Mayes, Barrett N barrett.n.mayes at intel.com
Thu May 9 09:31:35 PDT 2013


the miniport cannot assume that, after signally a Bus Change Notification to initiate the removal process, once the call to signal the Bus Change Notification returns, the upper layers will already have done all the work involved in the device removal process.

Correct.  The LUN isn’t “gone” until the miniport fails to report it in a subsequent inquiry.

From: Judy Brock-SSI [mailto:judy.brock at ssi.samsung.com]
Sent: Wednesday, May 01, 2013 4:29 PM
To: Mayes, Barrett N; Robles, Raymond C; nvmewin at lists.openfabrics.org
Subject: RE: [nvmewin] Handling pending commands when processing Format [changing from NVMe WG dist. list to OFA NVMe Windows Driver dist. list]

Hi Barret,

To sum up the info you provide below, I think it confirms the following point I was trying to make:

the miniport cannot assume that, after signally a Bus Change Notification to initiate the removal process, once the call to signal the Bus Change Notification returns, the upper layers will already have done all the work involved in the device removal process.

In fact, that there is a lot of work involved in device removal (for ex, we both cited the new set of inquires to come in to the miniport which allow the miniport to signal that a previously reported LUN is no longer present).

Thanks,
Judy


From: Mayes, Barrett N [mailto:barrett.n.mayes at intel.com]
Sent: Wednesday, May 01, 2013 10:00 AM
To: Judy Brock-SSI; Robles, Raymond C; nvmewin at lists.openfabrics.org
Subject: RE: [nvmewin] Handling pending commands when processing Format [changing from NVMe WG dist. list to OFA NVMe Windows Driver dist. list]

I’ll take a quick stab at the first part.  Will need Ray or someone with more experience on Format implementation to address the 2nd part as I’m not sure on the synchronization of completing the inquiries following the Bus Change notification and starting of the format command.


a)     I don’t think it’s true that Storport can handle commands completing for a device it no longer has a record of – ie, after the target was removed due to re-enumeration.I think it will have torn down its own structures for any old LUN(s) we had previously exposed and that would include any record it had of commands outstanding for those old LUNs. I think it isn’t going to hold on to ghost requests on behalf of devices that it no longer has a record of because it has no place to store such requests anyway at that point/ no object to associate them with.

In general, the port driver will stall its queues, complete any IO already in flight or queued to the device then delete the device object.  The process is fairly involved and is documented here: http://msdn.microsoft.com/en-us/library/windows/hardware/ff561046(v=vs.85).aspx  This describes how a WDM driver handles the remove process.  The other piece to this is the storport<->miniport interaction.  To initiate the removal process, the miniport can signal a Bus Change Notification to cause storport to send a new set of inquiries.  If a previously reported LUN is not reported in a subsequent inquiry, storport will call IoInvalidateDeviceRelations where it will in turn not report back a previously reported Device Object.  See the “BusRelations Request” section here: http://msdn.microsoft.com/en-us/library/windows/hardware/ff551670(v=vs.85).aspx and docs for IRP_MN_REMOVE_DEVICE here: http://msdn.microsoft.com/en-us/library/windows/hardware/ff551738(v=vs.85).aspx.





From: nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org> [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Judy Brock-SSI
Sent: Wednesday, May 01, 2013 3:33 AM
To: Robles, Raymond C; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: Re: [nvmewin] Handling pending commands when processing Format [changing from NVMe WG dist. list to OFA NVMe Windows Driver dist. list]

Hi Ray,

I was wondering if you or others had any thoughts about the below.

Thanks,
Judy

From: Judy Brock-SSI
Sent: Monday, April 22, 2013 11:27 PM
To: 'Robles, Raymond C'; WONMOON CHEON; ???; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: RE: Handling pending commands when processing Format [changing from NVMe WG dist. list to OFA NVMe Windows Driver dist. list]

Hi Ray,

Thanks for all the details.

The stuff about hot adding the namespaces back in, reissuing Identify Controller/Identify Device, etc was very easy to follow in the driver; it’s clear how & why that’s done.

However I’m still having some difficulty with the finishing-up-IOs-outstanding before starting the format issue.

I may be just covering ground the group covered a long time ago but I’m wondering if all the assumptions below are valid. I’m also interested in knowing how the group validated them.  Here are my concerns:

   >>  if a format comes down for a namespace, any I/O outstanding to the controller (there won’t be anything that needs to be sent… I/O will either be at the device or on the CQ)

Why do we think there won’t be anything that needs to be sent? The app that sends a pass through IOCTL to do the format is presumably completely independent from say any file IO that might be going on on behalf of other apps, or even raw IO from apps like Iometer. Seems like lots of IO could still be coming in when the format IOCTL is received.

  > > any I/O outstanding to the controller …will simply complete via normal operation or be aborted at the controller… but Storport won’t care because the SCSI target was already removed upon initially receiving the format command. Any I/O in the CQ will be completed and handled by Storport correctly.

I am wondering about two of the assumptions about timing in the paragraph above.


a)      I don’t think it’s true that Storport can handle commands completing for a device it no longer has a record of – ie, after the target was removed due to re-enumeration.I think it will have torn down its own structures for any old LUN(s) we had previously exposed and that would include any record it had of commands outstanding for those old LUNs. I think it isn’t going to hold on to ghost requests on behalf of devices that it no longer has a record of because it has no place to store such requests anyway at that point/ no object to associate them with.


b)     We are assuming that when we call StorPortNotification() with BusChangeDetected that Storport will come back in to the driver to rescan the bus – either via a bunch of Inquiry cmds or via a Report Luns cmd – and will finish all the work associated with the bus scan, updating it’s record of device topology (ie remove any SCSI target/LUNs that were assocated with the NVMe dev we are about to format) – all before it returns from our call to ScsiPortNotification and before the driver code continues on to call ProcessIO to start the real NVMe Format NVM operation.


I don’t know that that is a safe correct assumption to make. The bus scan could be deferred till the driver returns. Or even if it is launched right away, could the bus scan take place on a different processor while the proc that is running through the driver just continues on its’ way?

In my experience, it is the drivers and controllers’ joint responsibility to make sure that all outstanding IOs are completed back to the caller one way or the other before starting the format op. That means aborting whatever can be aborted and also making absolutely sure that no live request left over from a Namespace that has been removed ever gets completed back to the host after the Namespace has been removed.

Thanks,

Judy






From: Robles, Raymond C [mailto:raymond.c.robles at intel.com]
Sent: Thursday, April 18, 2013 6:08 PM
To: Judy Brock-SSI; WONMOON CHEON; ???; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: RE: Handling pending commands when processing Format [changing from NVMe WG dist. list to OFA NVMe Windows Driver dist. list]

Hi Judy,

Ahhh… I see now.  I didn’t answer that question below.  The format command is essentially built into the driver by a state machine.

When we receive the format command we immediately issue the *hot remove* command… but that is done inline. So, once we call Storport to kick off the enumeration, we simply return back to handling the format command in NVMeStartIoProcessIoctl().  The appropriate states are set along the way to indicate progress. Once the namespace is removed from the “OS view”, then the format is processed like any other command (via ProcessIo). The callback is setup to call NVMeIoctlFormatNVMCAllback() and the variable “FormatNvmInfo->AddNamespaceNeeded” is set to TRUE so that on the completion side we remember to have the OS re-enumerate after we are done.  Once the NVM format completes, the callback is invoked in the completion DPC. Then on the completion side we issue Identify Controller and Identify Namespace so that our cached driver data for the namespace(s) formatted are up to date. In the last state, after getting the Identify Namespace struct, we’ll call *hot add* which is described below.

Note that at no point do we “wait” for any I/O to finish. Format is a dangerous command… especially via pass through IOCTL. We talked about this quite a bit in the beginning of developing this driver.  But essentially, if a format comes down for a namespace, any I/O outstanding to the controller (there won’t be anything that needs to be sent… I/O will either be at the device or on the CQ) will simply complete via normal operation or be aborted at the controller… but Storport won’t care because the SCSI target was already removed upon initially receiving the format command. Any I/O in the CQ will be completed and handled by Storport correctly.

Thanks,
Ray

From: Judy Brock-SSI [mailto:judy.brock at ssi.samsung.com]
Sent: Thursday, April 18, 2013 4:47 PM
To: Robles, Raymond C; WONMOON CHEON; ???; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: RE: Handling pending commands when processing Format [changing from NVMe WG dist. list to OFA NVMe Windows Driver dist. list]

Hi Ray,
    [Ray wrote] Let me know if this answers your question.
I don’t think it does. What I wrote below I think was pretty much the same as what you wrote - or at least that was my intention ☺.
However, the piece I couldn’t explain (cause I haven’t looked into it) is how the driver holds off the beginning of the actual format NVM operation till whatever old IOs that were already in progress for the namespace(s) before the format op request was received are completed back to the caller, aborted, or whatever - so there are no old live requests hanging around, still in the driver , before the format op begins.
In other words, does the driver hold off starting the format cmd till the outstanding IOs are completed? Or do we perhaps just drop them on the floor and let the OS figure out that those requests are permanently lost/gone due to the LUNs having disappeared (my guess is, the latter is what we do)? Or do we try to abort them all? And so on.
So again, we do understand how to get the OS to avoid sending new I/O requests to stale namespaces but how exactly are the old I/O reqs (those existing at the time the format request comes in ) handled?
At least that is my current question ☺
Thanks,
Judy


From: Robles, Raymond C [mailto:raymond.c.robles at intel.com]
Sent: Thursday, April 18, 2013 2:46 PM
To: Judy Brock-SSI; WONMOON CHEON; ???; nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: RE: RE: Handling pending commands when processing Format [changing from NVMe WG dist. list to OFA NVMe Windows Driver dist. list]

Hello Judy/Wonmoon,
Sorry for the late response. The “hot remove” state is just a state that we enter when the driver receives a Format command.  Basically, this state will remove the namespace(s) from the topology by calling StorPortNotification() with BusChangeDetected.  This will remove the “SCSI target/disk” associated with each namespace form the OS (because Storport will re-enumerate the controller and the driver will not expose the namespaces about to be formatted) so that the format can occur on the relevant namespaces.
By signaling Windows that the namespaces have been “removed”, all I/O will be stopped by the OS.  Then the format can complete.  Once the format is complete, we perform the opposite action to “hot add” the namespace back into the topology by calling StorPortNotification() with BusChangeDetected… only this time, we will surface the namespace(s) again when Storport re-enumerates.
Not queues are deleted in this state, no memory is de-allocated, and nothing else changes about the namespace.  This is simply the first step (in a 3 step sequence) when formatting a namespace as we cannot format a namespace while the OS is aware of its presence and could be potentially sending I/O to a stale namespace config (i.e. changing LBA/sector size).
Let me know if this answers your question.
Thanks,
Ray

From: Judy Brock-SSI [mailto:judy.brock at ssi.samsung.com]
Sent: Wednesday, April 17, 2013 6:21 AM
To: WONMOON CHEON; 강미경; technical at nvmexpress.org<mailto:technical at nvmexpress.org>
Subject: RE: RE: Handling pending commands when processing Format

  >>Would you elaborate more about  the "hot-remove" state? In this state, do you mean that all the IO SQ/CQs are deleted? Or, waiting for completions of all the outstanding IOs?
The IO SQ/CQs are definitely NOT deleted. I would need to look more closely through the driver code to see how IOs previously sent to the namespaces which are marked as OFFLINE are handled/finished/quiesced.
There are other folks on this thread who no doubt have more history/intimate knowledge of this driver than I do who may answer that question more quickly than I can…also, perhaps this discussion should probably be moved to the OFA driver forum since it has turned into a driver-specific thread at this point.
What do folks think?
Judy
From: 천원문 [mailto:wm.cheon at samsung.com]
Sent: Wednesday, April 17, 2013 1:20 AM
To: Judy Brock-SSI; 강미경; technical at nvmexpress.org<mailto:technical at nvmexpress.org>
Subject: Re: RE: Handling pending commands when processing Format

Hi Judy,

Would you elaborate more about  the "hot-remove" state? In this state, do you mean that all the IO SQ/CQs are deleted? Or, waiting for completions of all the outstanding IOs?

Thanks,
Wonmoon

------- Original Message -------
Sender : Judy Brock-SSI<judy.brock at ssi.samsung.com<mailto:judy.brock at ssi.samsung.com>>
Date : 2013-04-17 16:39 (GMT+09:00)
Title : RE: Handling pending commands when processing Format

Hi,
I should clarify that it is not the Windows operating system – but rather the Windows OFA  NVMe driver -  that, from what I can see,  does a “hot-remove” of all namespace(s) associated with a device before allowing a format operation to begin;  “hot remove” is just the name for an internal state in the driver format nvm state machine.
Before beginning the actual format op, the driver internally marks all  namespaces associated with the format operation “offline”. It then notifies the OS that there has been a “bus change” event (via an OS-specific API).
This in turn will cause the OS to rescan (re-enumerate) the “bus” (the pseudo SCSI bus, that is  –  we expose NVM namespaces as SCSI luns).
Since all the pertinent namespaces have been marked offline internally, the bus rescan won’t detect any valid SCSI luns (because the driver will not report any).
Hence from the OS point of view, any SCSI lun(s) previously mapped to the namespace(s) to be formatted will have disappeared/will be unaddressable while the format operation is in progress.
Judy

From: Judy Brock-SSI
Sent: Tuesday, April 16, 2013 8:22 PM
To: 'mkkang.kang at samsung.com'; technical at nvmexpress.org<mailto:technical at nvmexpress.org>
Subject: RE: Handling pending commands when processing Format

Mikyeong,
I haven’t looked at the Linux driver but I know that Windows hot-removes all namespace(s) associated with a device before allowing a format operation to begin. And a namespace can’t be removed while there is IO outstanding to it so that answers your question regarding IOs being completed before format begins. It also answers the question about requests being sent to a namespace while format is in progress – can’t happen.
Thanks,
Judy



From: 강미경 [mailto:mkkang.kang at samsung.com]
Sent: Tuesday, April 16, 2013 7:07 PM
To: technical at nvmexpress.org<mailto:technical at nvmexpress.org>
Subject: Handling pending commands when processing Format

Dear All,

Format NVM command may change the Namespace repository, and it will be executed out of order like any other commands. Therefore, Format NVM command may affect other commands that are pending execution in the device, if any.

1) How does an OFA/linux driver handle 'Format NVM command'? Does a host make sure that all commands for a particular NSID are completed before sending 'Format NVM command'?

2) If a host driver does not behave like 1) above, how can a device handle other pending commands which were previously submitted in a SQ? It seems like we need an additional status code. e.g. Abort due to Namespace Format

3) Let's suppose that 'Format NVM command' is in progress. If the host driver sends subsequent commands to the namespace being formatted, should the device reply directly with a 'Namespace not Ready'?

[1.0e spec] If the device does not reply directly and the format operation takes long time, then, I/O command will timeout and the host may send the reset. But if commands are responded with 'Namespace Not Ready', host may not issue the reset. Therefore, direct reply seems to be needed.

[1.1 spec. ECN 001] There is Format progress indicator. The host driver can check format progress any time, therefore, there is no concern about reset during format command.


Best Regards,
Mikyeong Kang


Kang MiKyeong

Flash Memory Planning/Enabling Group, Memory Div.
SAMSUNG ELECTRONICS, Co., Ltd..
Phone: 82-31-208-3857
Mobile: 82-10-9369-0177
E-mail: mkkang.kang at samsung.com<mailto:mkkang.kang at samsung.com>




[cid:image001.jpg at 01CE3C59.BF34AF60]




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20130509/68b31cad/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 34869 bytes
Desc: image001.jpg
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20130509/68b31cad/attachment.jpg>


More information about the nvmewin mailing list