[nvmewin] Issues Found

Mon Sep 24 14:11:05 PDT 2012

Hi Paul,

As for the 2nd item I brought up, I believe we have set up a mapping between cores and queues before learning:
Cores# Queues#
0      1
1      2
2      3
...
While learning, we decide which queue to use based on the above mappings. When commands complete, the value of MsgID depends on the APIC settings, which is the purpose of learning. In other words, MsgId is not necessarily equal to QID. After fixing Item# 1, I have seen the failure of Driver State machine with Dbg build driver due to timeout in the learning state.

Alex

-----Original Message-----
From: Luse, Paul E [mailto:paul.e.luse at intel.com] 
Sent: Monday, September 24, 2012 1:46 PM
To: Luse, Paul E; Chang, Alex
Cc: nvmewin at lists.openfabrics.org
Subject: RE: [nvmewin] Issues Found

So I found myself with some time here and was able to prepare my next patch, I won't send it out til I have a chance to test it but I wanted to run through all the changes real quick before I responded.

The changes I'll be sending out are primarily focused around sharing the admin queue MSIX with one other queue (which ones depends on learning).  This came about because of a bug report from someone running on a 32 core system - in this case we ask for 33 vectors and when we don't get them we end up dropping to 1 and sharing everything.  This, of course, works but under heavy IO it causes DPC watchdog timeouts simply due to the amount of time we spend looking through all the queues processing IOs.  The load in question was 32 workers (iometer) and 64 IO depth with 512B reads.  There are several different ways we could address this but the one I'm suggesting as a generic improvement is to have the admin queue share with another queue so that we require an even number of vectors and can readily support 32 cores which is a pretty common config.

In the process of putting this together I ran into the item you mention below Alex so have already fixed that.  Had not previously tested on a system with multiple NUMA nodes but clearly with that LOC in there, we don't init enough queues, we setup 32 allright but we do the same set of 16 twice.  So, this is fixed in my patch.

On your second question, good question BYW, this is one of the reasons why learning mode works.  We know which queue to look in only because we are still in learning mode and we set the queues up so that we can count on QID==MsgId.  Remember, we're learning the association between MSIX vector and completing core, then updating the tables and deleting/recreating the CQ so once learning is done we use the table but before we count on how we set things up.

On your 3rd question, we didn't write or test that code, I forget who added it but I would consider it untested and a prime candidate for anyone wanting to contribute :)  We at Intel will be looking more closely at that code in the coming months.

Anyway, hope that answers your questions and I'll send out my patch either tonight that includes the fix the first item below, some additional debug prints (via compile switch) to dump our PRP info as you go, a few additional assert, etc.  Its not very big.  After that we'll be coming with a series of AER fixes.

Thx
Paul

-----Original Message-----
From: nvmewin-bounces at lists.openfabrics.org [mailto:nvmewin-bounces at lists.openfabrics.org] On Behalf Of Luse, Paul E
Sent: Monday, September 24, 2012 11:59 AM
To: Chang, Alex
Cc: nvmewin at lists.openfabrics.org
Subject: Re: [nvmewin] Issues Found

Alex

I'll cover your questions this evening plus have some non bug fix changes in some of these areas anyways.

Thx
Paul

Sent from my iPhone

On Sep 24, 2012, at 9:46 AM, "Chang, Alex" <Alex.Chang at idt.com<mailto:Alex.Chang at idt.com>> wrote:

Hi Paul,

When testing the latest patch I added, I came across couple issues in the driver:
1. In the patch you sent out on July 13 (later tagged as misc_bug_fixes_and_enum), within NVMeAllocIoQueues function, you reset the QueueID for each NUMA node loop as below:
        for (Node = 0; Node < pRMT->NumNumaNodes; Node++) {
            pNNT = pRMT->pNumaNodeTbl + Node;
            QueueID = 0;
            for (Core = pNNT->FirstCoreNum; Core <= pNNT->LastCoreNum; Core++) { It turns out only allocating the number of cores of a given NUMA node for the entire system. I wonder why?

2. When the driver is in learning phase where it tries to find out the mappings between cores and MSI vectors, in IoCompletionDpcRoutine, the driver limits the pending completion entry checking based on MsgID:
        if (!learning) {
            firstCheckQueue = lastCheckQueue = pMMT->CplQueueNum;
        } else {
            firstCheckQueue = lastCheckQueue = (USHORT)MsgID;
        }
Since it's still in learning phase, shouldn't it look up every created completion queue to find out the mapping?

3. In NVMeInitialize, the driver call StorPortInitializePerfOpts in both normal and Crashdump/Hibernation cases. The routine returns failure and I wonder if it makes sense to call it in Crashdump/Hibernation case.

Thanks,
Alex

_______________________________________________
nvmewin mailing list
nvmewin at lists.openfabrics.org<mailto:nvmewin at lists.openfabrics.org>
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/nvmewin
_______________________________________________
nvmewin mailing list
nvmewin at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/nvmewin