[nvmewin] Learning Mode Patch for Review - UPDATED

Chang, Alex Alex.Chang at idt.com
Tue Mar 27 08:42:23 PDT 2012


Hi Paul,

Yes, that's what I meant. The sole purpose of learning mode you added is to find out the mappings between vectors and cores. If we allocate queue memory based on NUMA locality for each core, and then delete/re-create queues after learning is completed, the NUMA locality is still being covered without re-allocating the queue memory.

Thanks,
Alex

________________________________
From: Luse, Paul E [mailto:paul.e.luse at intel.com]
Sent: Monday, March 26, 2012 7:18 PM
To: Chang, Alex; nvmewin at lists.openfabrics.org
Subject: RE: Learning Mode Patch for Review - UPDATED

I guess we could do that, keep the queue-core association and change just the IV for the CQ, is that what you're suggesting?   So, for example, below shows what mapping might look like before, then after learning with how I have it coded vs what you are saying.  If I understand you correctly I do like that way better, pls take a look below and confirm for me.

BEFORE
Core       QP          MSIX      NUMA
0             1             1             0
1             2             2             1
2             3             3             2

WHAT I'M DOING
Core       QP          MSIX      NUMA
0             8             8             0
1             1             1             1
2             2             2             2

WHAT YOU ARE SUGGESTING
Core       QP          MSIX      NUMA
0             1             8             0
1             2             1             1
2             3             2             2




From: Chang, Alex [mailto:Alex.Chang at idt.com]
Sent: Monday, March 26, 2012 6:41 PM
To: Luse, Paul E; nvmewin at lists.openfabrics.org
Subject: RE: Learning Mode Patch for Review - UPDATED

Hi Paul,

In the changes of nvmestat.c, you mentioned:
NVMeRunningWaitOnReSetupQueues().  Delete the queues (frees the mem also), set the flag to indicate that we need to allocate them again and then jump back to the state where we allocate them.
I am not sure it's necessary to free the allocated IO queue memory. Isn't it enough to adjust the mapping between cores and queues after re-creating the queue?

Thanks,
Alex



________________________________
From: nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org> [nvmewin-bounces at lists.openfabrics.org] on behalf of Luse, Paul E [paul.e.luse at intel.com]
Sent: Monday, March 26, 2012 5:52 PM
To: nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: Re: [nvmewin] Learning Mode Patch for Review - UPDATED
Quick testing update:  ran into a QEMU bug that I've since fixed and enabled me to continue testing.  There will be a few small tweaks to the package I sent out but nothing major.  I have to head to the OFA developer workshop for the next two days so will pick up on this again Thu.  I won't schedule the review until I get everything 100% tested.

Thx
Paul

From: Luse, Paul E
Sent: Friday, March 23, 2012 4:25 PM
To: nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: Learning Mode Patch for Review - UPDATED

Here are the changes in addition to the ones below to accommodate the deletion/recreation of queues as part of learning mode.  Looks like a lot only because I touched a lot of functions for cleanup in a few areas :)

I'm not totally done testing, I only have the final queue deletion to step through and am having a small challenge w/QEMU and my proto HW doesn't support deleting queues, I'll get it figured out soon though.  I'll setup a review for late next week, if anyone would like more time let me know - there's no blazing hurry on this but I do think we want this to be our first release (so we're actually NUMA optimized)

Pw is ofanvme123

nvmeStd.h

-        Up'd the poll mode timeout from 1 to 3 and renamed it as its now used for more than just crashdump, it's a generic polling retry value

-        New error states and also I repurposed this from an init state machine specific set of errors to general errors that may throw the driver into a 'shutdown state' as well.  The shutdown routine was using this enum and the driver state to indicate shutdown before however the names/usages were all related to the init state machine which I think could be confusing for new folks reading the code

-        Added 2 new init states, learning and re-setting up the queues

-        A few fairly obvious changes to support logic changes in the C code described later

-        Fixed the device state timeout counter, was a UCHAR and not big enough to handle our timeout counter



nvmeInit.h

-         New parm for the alloc queue function, need to tell it how many entries on the queue as we don't need large queues for learning and it's a waste of alloc/free larger than needed contiguous chunks of memory if we used one size

nvmeStd.c

-        By default I set the # of queues to create for learning to the previously defined minimum for the driver

-        Multiple changes where calling NVMeAllocQueues() to pass the new parm, # of queue entries

-        Multiple changes where I renamed StartState to DriveState to reflect how we're using it anyway

-        In NVMeInitialize() removed the init of num Q's allocated elements in queue info, its overwritten later regardless

-        Some refactoring in both the ISR and DPC around the logic where we decide whether we need to call storportNotification or not, with the new condition added (whether caller wants synchronous IO or not) it was too confusing to try and work it into the previous set of conditions

-        This is where we clear the srbExt synchronous flag which, for sync IO, tells the caller (more on this later) that the IO has now completed

-        Fixed the passiveInit timeout handler, it was relying on the time drive state machine thread to switch states, if it hung then this would never happen so we'd spin forever.  So, made passiveInit do the time based timing and removed from the init machine (don't need to time in both places)

-        Removed the startiolock from the learning mode condition in the ISR/DPC, not needed until we implement concurrent_channels

nvmeStat.c

-        Multiple changes where I renamed StartState to DriveState to reflect how we're using it anyway

-        Addition of 2 new states in NVMeRunning()

-        State handler added: NVMeRunningWaitOnLearnMapping().  If we're in sub-optimal conditions to begin with or if we've already done this state, we skip the learning and re-setup queues states.  Also, if no namespace exists there's no way to learn so we just skip init learning and go non-NUMA optimized.  The existing learning logic will pick the right queue and adjust the mapping tables (so we still learn as the host sends IOs) we just won't resetup the queues.  The completion side will go back to this state if we have more IOs to learn.  FLUSH is a mandatory command (and doesn't require data xfer) so it was the logical choice here however I made it a read of 1 block in case IHV's have issues with flush (either not supporting or not wanting a flush on every boot for whatever reason).

-        State handler added: NVMeRunningWaitOnReSetupQueues().  Delete the queues (frees the mem also), set the flag to indicate that we need to allocate them again and then jump back to the state where we allocate them.  Learning won't happen again because the learning state handler will recognize that its been completed once

-        Moved timeout check to passive init thread

nvmeSnti.c

-        renamed StartState to DriveState to reflect how we're using it anyway

nvmeIO.c

-        changes here support being able to do sync IO.  The shutdown routines, which I reused for learning mode resetup, needed sync IO but hadn't added it so this is technically a change that was needed anyway

-        moved the NVMeIssueCmd() ret value check to after the call (was in the wrong place)

-        after we issue the command, we handle polling for crashdump same as before (where there are no INTs)

-        after we handle crashdump, added we poll here for sync IO to complete but because INTs are enabled, we don't poll by calling the ISR, we just wait for the ISR to fire and clear the flag in the srbExt that's set if the IO is sync

nvmeInit.h

-        function proto change for new parm

nvmeInit.c

-        NVMeAllocQueues() changed to accept a parm for # of entries to alloc

-        Multiple changes where I renamed StartState to DriveState to reflect how we're using it anyway

-        New callback: NVMeDeleteQueueCallback.  Simply decrements the # of deleted queues if no error

-        In the init callback for the NVMeWaitOnIoSQ state, we now check how many cores we've learned and skip learning  if done.  Otherwise we move to the learning state

-        In the init callback, new case for NVMeWaitOnLearnMapping:  If we have more cores to learn, stay in this state, otherwise move onto remapping/setting up queues.  There's no new completion state for resttting up the queues, things will then end in the NVMeWaitOnIoSQ

-        Added missing default handler

-        In both delete queues functions, changes to support making them synchronous

-        In NVMeAllocIoQueues() we set the number of queues we want based on learning mode or not



Sources

-        Removed QEMU define as we only used it for some asserts that are no longer needed w/learning mode



Also throughout you'll find a new call NVMeDriverFatalError() that replaces a bunch of places where we were setting and error code, setting a state and logging an error.  There's one parm in here to indicate whether its being called from the init state machine or not so it nows whether to fire off the arbiter.  That way this single call can be used anywhere to flag a fatal error, log it in the system log and change the DriverState to avoid accepting new IO.  This is clearly only for cases when we want to stop accepting IO or simply cannot.




________________________________
From: nvmewin-bounces at lists.openfabrics.org<mailto:nvmewin-bounces at lists.openfabrics.org> [mailto:nvmewin-bounces at lists.openfabrics.org]<mailto:[mailto:nvmewin-bounces at lists.openfabrics.org]> On Behalf Of Luse, Paul E
Sent: Friday, March 16, 2012 9:42 AM
To: nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
Subject: [nvmewin] ***UNCHECKED*** Learning Mode Patch for Review
Importance: High
Note that I merged Alex's patch in here as it's still pending with Ray but is only a half-dozen lines of code so rather than have Ray create to tags and do 2 merges I just rolled them together so Ray can just process one patch.  Normally wouldn't recommend this but the pending patch is so small it just makes sense.

Password is ofanvme123.  I can schedule a call to walk through any of this if anyone would like.

Learning Mode:  Pretty easy, its only enabled if we're in an optimal config wrt cores/queues/vectors.  Assume we have N processors, it works like this.... On startup the first N IOs will be sent to queue # 1..N sequentially.  Each queue is created with a matching MSI ID so in this manner we assure to hit every queue and every message ID by incrementing the queue # for the first N IOs regardless of the submitting core.  On the completion side we simply look at the core that we completed on and update the table for the completing core such that the correct queues are used the next time an IO submits to this core.

Testing: (in all cases I confirmed functionality and via iometer and xperf I confirmed core load balancing)

-         Chatham (DT with 8 cores):  Win8 Server public beta, 2008R2

-         QEMU (configured for 2 cores): Win7-64

-         Tools:  The standard stuff:  format, stress, iometer, SCSI compliance, shutdown/restart

Changes:
nvmeStd.h:

-         Comment changes as needed

-         Removed the logcalmode element from the msgTable as its no longer used

-         Removed DBG directive from procNum element, its now required for learning mode

-         Removed proto for NVMeMsixMapCores, not used anymore (combined MSI and MSIX routines, more later)



nvmeIo.c:

-         Removed DBG directive from procNum element, its now required for learning mode



nvmeInit.c

-         Removed NVMeSwap(),NVMeQuickSort(),NVMeMsixMapCores() as they're no longer used

-         Changes to NVMeMsiMapCores():

o   Set the initial core table MsgID to the CplQ number, this is needed so that we can determine the CplQ from the MsgId during learning mode while in the DPC/ISR

o   The other changes are just code simplifications

-         Changed NVMeCompleteResMapTbl() to use NVMeMsiMapCores() for either MSI or MSIX

-         Changed NVMeMapCore2Queue():

o   We now check if we're in learning mode, if not then we simply look up the queue num from the core table

o   If we're in learning mode (based on a simple count of how many cores we've learned vs total available cores), then we use the core number that we're on (that we're learning) + 1, the +1 is because all queue numbers are core+1 by our convention

-         Change in NVMeAllocIoQueues() to effectively disable learning mode if we only have 1 queue (it makes no sense and actually causes problems for learning mode if we don't do this).  We disable it by pretending that we've already learned all the cores

nvmeStd.c:

-         In NVMePassiveInitialize(), disable learning mode is we're sharing one MSI over multiple queues, same reasons as when we hae one queue

-         In NVMeInitialize():

o   Enable DPC perf opt per Msft recommendation to steer stoport completion DPCs back to the submitting core

-         In NVMeStartIo() - bug fix unrelated to learning mode but I found it while debugging learning mode (via BSOD).  You can't return FALSE from this function per MSDN docs.  Always return TRUE

-         In IoCompletionDpcRoutine() and same changes in the ISR when its enabled for completions instead:

o   Merged Alex's bugfixes in with my changes

o   Removed the DBG related code for checking core affiliation

o   Where we decide which queues to check, I set a Boolean to determine if we're learning or not to FALSE if we're in shared mode because we disable learning mode during init if that's the case

o   If we're not shared, the learning Boolean is set based on how many cores we've learned and whether the MsgId is >0 as MsgId 0 is admin and we exclude that from learning mode

o   If we're not learning then, we only search the queue specified in the MMT

o   If we are learning, we know the queue # is the same as the MsdId because we init'd it that way

o   The 'learning' happens in a new coniditonal just after we determine we have an srbExt.  It works as follows:

*  Grab the startIO lock as we're sharing the core table with startIO and during learning mode we're not yet assured that start/complete are on the same core.  Note the lock is only taken on IOs during learning mode (the first few IOs)

*  Lookup the CT entry for the core that we're completing on and set its queue numbers to queue number that was associated with the IO that just completed.  This assures that the next lookup in the table for this core # will complete on this, the same, core.

*  Increment our learning counter which will direct the next IO to the next core

-         Unrelated changes to NVMeProcessIoctl(): A few changes were made here as the routine assumed every IOCTL we'd get would be a PT IOCTL making it difficult for venders to add additional private IOCTLs.  Just moved things around a bit as we had to add one to our product specific code.  No other changes here other than placing IOCTL specific code in the correct case block

-

____________________________________
Paul Luse
Sr. Staff Engineer
PCG Server Software Engineering
Desk: 480.554.3688, Mobile: 480.334.4630

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20120327/4042a733/attachment.html>


More information about the nvmewin mailing list