[nvmewin] Full system lockup w/ all NVMe drivers

Mahmoud Al-Qudsi mqudsi at neosmart.net
Fri Mar 24 14:50:02 PDT 2017


Hello list,

I’m writing here after attempting to use the openfabrics nvmewin driver on two different 7th-gen Intel machines (CM238 chipsets) under multiple, clean installations of Windows 10. In each case, I end up at a point where all disk access locks up and the machine slowly grinds to a halt (without a BSOD) as requests for unpaged data from the disk pile up.

Generous use of !storagekd.* in windbg reveal that the last of the pending requests to the disk is a RESET LUN srb; previous commands failed with a non-descript SRB failure code 0x04, indicating a generic HBA failure without a specific error code. The sense data for the failed requests is all 0s.

The same physical disk (a Samsung 960 Pro) work(s/ed) just fine in a different machine (6th-gen Xeon, CM236 chipset). The hang occurs without fail under random write stress testing, but it also happens when the machine is left unattended for a few hours.

I’ve attempted to disable PCI-E link management power savings, automatic shutdown of the hard disk, etc. in the power savings options, but all to no avail.

I’m really not sure what to try next. The testing has been primarily under Windows 10 RS2 betas, a clean install of build 15011 did not trigger the failure case, but perhaps it was not tested long enough.

This system lockup occurs with the Microsoft, Samsung, Intel, and now OF nvmewin drivers. No 3rd party upper/lower filters are installed.

I’ve attempted to track down the problem by logging all storport.sys/miniport commands via the performance monitor, but unfortunately it absolutely refuses to use unbuffered writes (smallest buffer size option is 1kb and the most often a flush can be configured for is every 1 second). I can visibly see the USB drive it is logging to blink as writes are flushed after disk access locks up, but still, the resulting ETL does not contain the very last requests to the disk as it reveals no timeouts, retries, or the final LUN reset. 

This error occurs even in safe mode.

I am genuinely at my wits’ end with this one. I initially thought it was a very odd hardware-related error, but that seems to be ruled out by the fact that it occurs in multiple devices with the same drive, yet that same drive works flawlessly in other machines.

I’d appreciate any insight or suggestions anyone has.

Thank you,

Mahmoud Al-Qudsi
NeoSmart Technologies


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.openfabrics.org/pipermail/nvmewin/attachments/20170324/9d1df7ee/attachment.html>


More information about the nvmewin mailing list