Samsung SM961 NVMe SSD randomly unmounts/loses connection/unavailable

Bug #1737934 reported by Lode Lesage
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
High
Unassigned

Bug Description

Seems related to these bugs:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184
https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/1682704
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1705748

Problem:
At seemingly random times my computer (brand new Lenovo Thinkpad T470) seems to lose access to the Samsung SM961 256GB SSD drive it has inside. When this happens the whole OS freezes up and when I try to power down I see a black terminal-like screen that prints the following errors:

EXT4-fs error (device nvme0n1p2): ext4_find_entry:1431: inode #7471275 (or #741278): comm gmain (or systemd-journal or ...): reading directory iblock 0

This error seems to be repeated endlessly, though I've only let it go for a few minutes. No other errors are printed.

This is the only drive it has.
I don't know if this occurs in Windows too since I removed Windows and installed Ubuntu immediatly after updating the BIOS.

Info:
Distro: Ubuntu MATE 17.10

sudo uname -r
4.13.0-19-generic

sudo nvme get-feature -f 0x0c -H /dev/nvme0 (with latency set to 250)
get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
 Autonomous Power State Transition Enable (APSTE): Enabled
 Auto PST Entries .................
 Entry[ 0]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 1]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 2]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 3]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 4]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 5]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 6]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 7]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 8]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 9]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[10]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[11]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[12]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[13]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[14]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[15]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[16]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[17]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[18]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[19]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[20]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[21]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[22]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[23]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[24]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[25]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[26]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[27]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[28]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[29]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[30]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[31]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................

sudo smartctl -a /dev/nvme0
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.13.0-19-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVLW256HEHP-000L7
Serial Number: S35ENX0JA13385
Firmware Version: 4L7QCXB7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 256.060.514.304 [256 GB]
Unallocated NVM Capacity: 0
Controller ID: 2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 256.060.514.304 [256 GB]
Namespace 1 Utilization: 17.834.708.992 [17,8 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Wed Dec 13 10:52:39 2017 CET
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Warning Comp. Temp. Threshold: 69 Celsius
Critical Comp. Temp. Threshold: 72 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
 0 + 7.60W - - 0 0 0 0 0 0
 1 + 6.00W - - 1 1 1 1 0 0
 2 + 5.10W - - 2 2 2 2 0 0
 3 - 0.0400W - - 3 3 3 3 210 1500
 4 - 0.0050W - - 4 4 4 4 2200 6000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
 0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning: 0x00
Temperature: 35 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 151.517 [77,5 GB]
Data Units Written: 160.733 [82,2 GB]
Host Read Commands: 1.874.938
Host Write Commands: 1.650.810
Controller Busy Time: 10
Power Cycles: 96
Power On Hours: 14
Unsafe Shutdowns: 78
Media and Data Integrity Errors: 0
Error Information Log Entries: 38
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 35 Celsius
Temperature Sensor 2: 61 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
  0 38 0 0x0018 0x4004 0x02c 0 0 -
  1 37 0 0x0017 0x4004 0x02c 0 0 -
  2 36 0 0x0018 0x4004 0x02c 0 0 -
  3 35 0 0x0017 0x4004 0x02c 0 0 -
  4 34 0 0x0018 0x4004 0x02c 0 0 -
  5 33 0 0x0017 0x4004 0x02c 0 0 -
  6 32 0 0x0018 0x4004 0x02c 0 0 -
  7 31 0 0x0017 0x4004 0x02c 0 0 -
  8 30 0 0x0018 0x4004 0x02c 0 0 -
  9 29 0 0x0017 0x4004 0x02c 0 0 -
 10 28 0 0x0018 0x4004 0x02c 0 0 -
 11 27 0 0x0017 0x4004 0x02c 0 0 -
 12 26 0 0x0018 0x4004 0x02c 0 0 -
 13 25 0 0x0017 0x4004 0x02c 0 0 -
 14 24 0 0x0018 0x4004 0x02c 0 0 -
 15 23 0 0x0017 0x4004 0x02c 0 0 -
... (22 entries not shown)

lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers [8086:5904] (rev 02)
00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 620 [8086:5916] (rev 02)
00:14.0 USB controller [0c03]: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller [8086:9d2f] (rev 21)
00:14.2 Signal processing controller [1180]: Intel Corporation Sunrise Point-LP Thermal subsystem [8086:9d31] (rev 21)
00:16.0 Communication controller [0780]: Intel Corporation Sunrise Point-LP CSME HECI #1 [8086:9d3a] (rev 21)
00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port [8086:9d10] (rev f1)
00:1c.6 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port #7 [8086:9d16] (rev f1)
00:1d.0 PCI bridge [0604]: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 [8086:9d18] (rev f1)
00:1d.2 PCI bridge [0604]: Intel Corporation Device [8086:9d1a] (rev f1)
00:1f.0 ISA bridge [0601]: Intel Corporation Sunrise Point-LP LPC Controller [8086:9d58] (rev 21)
00:1f.2 Memory controller [0580]: Intel Corporation Sunrise Point-LP PMC [8086:9d21] (rev 21)
00:1f.3 Audio device [0403]: Intel Corporation Device [8086:9d71] (rev 21)
00:1f.4 SMBus [0c05]: Intel Corporation Sunrise Point-LP SMBus [8086:9d23] (rev 21)
00:1f.6 Ethernet controller [0200]: Intel Corporation Ethernet Connection (4) I219-V [8086:15d8] (rev 21)
04:00.0 Network controller [0280]: Intel Corporation Wireless 8265 / 8275 [8086:24fd] (rev 78)
3e:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961 [144d:a804]

sudo nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S35ENX0JA13385 SAMSUNG MZVLW256HEHP-000L7 1 17,83 GB / 256,06 GB 512 B + 0 B 4L7QCXB7

I tried looking for kernel errors with dmesg | grep -i nvme and dmesg | grep -i EXT4-fs, but nothing of value shows up (only that the drive was mounted).

What have I tried:
Reading the bug reports mentioned above it seemed that my problem should already be fixed since I'm on kernel 4.13.
Since I still have the problem, I should be able to temporarily fix it by setting
GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0"
and running sudo update-grub.
This doesn't help however, since I still get random loss of connection to the drive. Sometimes this happens after minutes of booting, sometimes after hours, but I haven't been able to run stable for a full day.
I have tried all values for the latency I found in various bug reports online, specifically I tried: 0, 250, 5500, 6000 and 11000. It seems to run most stable with 250 and least stable with 0, where the error happens seconds/minutes after boot.

I'm at my wits' end here. I could just stick a normal SATA SSD in there but then this brand new one would be a waste. Any help would be greatly appreciated.
If more info is needed I'll do my best to provide it.

ProblemType: Bug
DistroRelease: Ubuntu 17.10
Package: linux-image-4.13.0-19-generic 4.13.0-19.22
ProcVersionSignature: Ubuntu 4.13.0-19.22-generic 4.13.13
Uname: Linux 4.13.0-19-generic x86_64
ApportVersion: 2.20.7-0ubuntu3.5
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: musilitar 1334 F.... pulseaudio
CurrentDesktop: MATE
Date: Wed Dec 13 11:11:54 2017
InstallationDate: Installed on 2017-12-07 (5 days ago)
InstallationMedia: Ubuntu-MATE 17.10 "Artful Aardvark" - Release amd64 (20171018)
MachineType: LENOVO 20HD0001MB
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.13.0-19-generic.efi.signed root=UUID=99902074-7315-4905-8a7d-97b65a09bb74 ro quiet splash nvme_core.default_ps_max_latency_us=250 vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-4.13.0-19-generic N/A
 linux-backports-modules-4.13.0-19-generic N/A
 linux-firmware 1.169.1
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 11/10/2017
dmi.bios.vendor: LENOVO
dmi.bios.version: N1QET68W (1.43 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20HD0001MB
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40697 WIN
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrN1QET68W(1.43):bd11/10/2017:svnLENOVO:pn20HD0001MB:pvrThinkPadT470:rvnLENOVO:rn20HD0001MB:rvrSDK0J40697WIN:cvnLENOVO:ct10:cvrNone:
dmi.product.family: ThinkPad T470
dmi.product.name: 20HD0001MB
dmi.product.version: ThinkPad T470
dmi.sys.vendor: LENOVO

Revision history for this message
Lode Lesage (musilitar) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.15 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15-rc3

tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Does the issue happen after a system suspend?

Revision history for this message
Lode Lesage (musilitar) wrote :

@ #3:

I installed the latest upstream kernel and will report back with the results.

uname -r
4.15.0-041500rc3-generic

I assumed it would also make the most sense to remove the GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=250" parameter to see if the upstream kernel fixes the problem completely. Is this assumption correct?

@ #4:

I normally never suspend my laptop, but I tried it a few times (~4) and it does not seem to affect the problem. It definitely didn't make it appear immediately and also does not seem to make it happen sooner.

Revision history for this message
Lode Lesage (musilitar) wrote :

Well...
Seconds after my previous message it happened again.
So it seems the upstream kernel does not fix the problem.
I'm adding GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=250" again.

tags: added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please attach output of `nvme id-ctrl /dev/nvme0`?

Revision history for this message
Lode Lesage (musilitar) wrote :

Here you go:

sudo nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : S35ENX0JA13385
mn : SAMSUNG MZVLW256HEHP-000L7
fr : 4L7QCXB7
rab : 2
ieee : 002538
cmic : 0
mdts : 0
cntlid : 2
ver : 10200
rtd3r : 186a0
rtd3e : 4c4b40
oaes : 0
oacs : 0x17
acl : 7
aerl : 3
frmw : 0x16
lpa : 0x3
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 342
cctemp : 345
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 256060514304
unvmcap : 0
rpmbs : 0
sqes : 0x66
cqes : 0x44
nn : 1
oncs : 0x1f
fuses : 0
fna : 0x4
vwc : 0x1
awun : 255
awupf : 0
nvscc : 1
acwu : 0
sgls : 0
subnqn :
ps 0 : mp:7.60W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:5.10W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you try "nvme_core.default_ps_max_latency_us=1500"? This will disable ps4, which causes lots of troubles.

Revision history for this message
Lode Lesage (musilitar) wrote :

I tried it but the problem happened again, twice, quite quickly after booting.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Sorry, I re-read your bug description, do have the same issue on "nvme_core.default_ps_max_latency_us=0", which disable APST completely?

Revision history for this message
Lode Lesage (musilitar) wrote :

No problem, I am very grateful for your help!

As I said in the "What have I tried"-section (sorry if it got lost in all the info, I tried to give as much as possible), yes I have tried that and yes I have the same problem. It seems to be worse actually when I put 0 as latency.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please try this kernel without "nvme_core.default_ps_max_latency_us=".

http://people.canonical.com/~khfeng/lp1737934/

Revision history for this message
Lode Lesage (musilitar) wrote :

I installed it and removed the kernel parameter, but it happened again less than an hour after booting.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you attach output of `nvme get-feature -f 0x0c -H /dev/nvme0` with kernel in comment #13?

Also `sudo lspci -vvv`.

Revision history for this message
Lode Lesage (musilitar) wrote :
Download full text (32.6 KiB)

First I want to say thanks again for all the help!

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

sudo uname -r
4.15.0-rc4+ (kernel from #13)

sudo nvme get-feature -f 0x0c -H /dev/nvme0
get-feature:0xc (Autonomous Power State Transition), Current value:00000000
 Autonomous Power State Transition Enable (APSTE): Disabled
 Auto PST Entries .................
 Entry[ 0]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 1]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 2]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 3]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 4]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 5]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 6]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 7]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 8]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[ 9]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[10]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[11]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[12]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[13]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[14]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[15]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[16]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[17]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[18]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[19]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition Power State (ITPS): 0
 .................
 Entry[20]
 .................
 Idle Time Prior to Transition (ITPT): 0 ms
 Idle Transition P...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

So when you use "nvme_core.default_ps_max_latency_us=250", the issue is completely gone?

Revision history for this message
Lode Lesage (musilitar) wrote :

No, it just had the best result out of everything I have tried. It still hangs occasionally, especially after a cold boot.
The kernel from #13 is comparable so far: ~2 hangups per day, mostly after cold boot.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Hmm, wondering if kernel parameter "pcie_port_pm=off" helps?

Revision history for this message
Lode Lesage (musilitar) wrote :

I tried it and it worked for 1 day, but it just happened again.
Maybe this isn't a problem with Ubuntu/Linux after all?
I have already checked the physical connection of the drive and as far as I could find out the drive itself is healthy too, so I have no idea what the problem might be...
Any more ideas?

Revision history for this message
Lode Lesage (musilitar) wrote :

I'm starting to think it isn't a software issue.
I tried to solve it again because it was becoming impossible to work, and decided to install Ubuntu MATE 16.4.3 LTS (kernel 4.10.0-42-generic) because I read some people had less problems with NVMe drives on that.

The problem persisted however.
I also noticed that sometimes when I booted and went into the BIOS the drive would even lose connection/not be visible there, which to me indicates that it might be a hardware issue? Any thoughts on that? Any way I can determine for sure that I don't have a faulty drive?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote : Re: [Bug 1737934] Re: Samsung SM961 NVMe SSD randomly unmounts/loses connection/unavailable
Download full text (16.4 KiB)

> On 28 Dec 2017, at 5:12 PM, Lode Lesage <email address hidden> wrote:
>
> I'm starting to think it isn't a software issue.
> I tried to solve it again because it was becoming impossible to work, and decided to install Ubuntu MATE 16.4.3 LTS (kernel 4.10.0-42-generic) because I read some people had less problems with NVMe drives on that.

This issue is distro-agnostic.

>
> The problem persisted however.
> I also noticed that sometimes when I booted and went into the BIOS the drive would even lose connection/not be visible there, which to me indicates that it might be a hardware issue? Any thoughts on that? Any way I can determine for sure that I don't have a faulty drive?

There are three things that worth to try,
- Check if Windows also has this problem
- Update system BIOS to latest version.
- Update NVMe firmware to latest version, probably only available under Windows.

IIRC, Samsung NVMes also have this problem under Windows, a firmware update solved the issue.

>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1737934
>
> Title:
> Samsung SM961 NVMe SSD randomly unmounts/loses connection/unavailable
>
> Status in linux package in Ubuntu:
> Confirmed
>
> Bug description:
> Seems related to these bugs:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184
> https://bugs.launchpad.net/ubuntu/+source/linux-signed/+bug/1682704
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1705748
>
> Problem:
> At seemingly random times my computer (brand new Lenovo Thinkpad T470) seems to lose access to the Samsung SM961 256GB SSD drive it has inside. When this happens the whole OS freezes up and when I try to power down I see a black terminal-like screen that prints the following errors:
>
> EXT4-fs error (device nvme0n1p2): ext4_find_entry:1431: inode #7471275
> (or #741278): comm gmain (or systemd-journal or ...): reading
> directory iblock 0
>
> This error seems to be repeated endlessly, though I've only let it go
> for a few minutes. No other errors are printed.
>
> This is the only drive it has.
> I don't know if this occurs in Windows too since I removed Windows and installed Ubuntu immediatly after updating the BIOS.
>
> Info:
> Distro: Ubuntu MATE 17.10
>
> sudo uname -r
> 4.13.0-19-generic
>
> sudo nvme get-feature -f 0x0c -H /dev/nvme0 (with latency set to 250)
> get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
> Autonomous Power State Transition Enable (APSTE): Enabled
> Auto PST Entries .................
> Entry[ 0]
> .................
> Idle Time Prior to Transition (ITPT): 0 ms
> Idle Transition Power State (ITPS): 0
> .................
> Entry[ 1]
> .................
> Idle Time Prior to Transition (ITPT): 0 ms
> Idle Transition Power State (ITPS): 0
> .................
> Entry[ 2]
> .................
> Idle Time Prior to Transition (ITPT): 0 ms
> Idle Transition Power State (ITPS): 0
> .................
> Entry[ 3]
> .................
> Idle Time Prior to Transition (ITPT): 0 ms
> Idle Transition Power State (ITPS):...

Revision history for this message
Lode Lesage (musilitar) wrote :

Had the same problem in Windows (I think, harder to diagnose on Windows), even after installing updates for BIOS, firmware & drivers, so this is probably a hardware problem.
Lenovo support should visit me soon to fix the problem (with new hardware), I'll give an update after that.

Revision history for this message
Lode Lesage (musilitar) wrote :

To add some closure to this ticket:
Lenovo support had to visit me 3 times:
- first they replaced the SSD -> same problem
- then they replaced the SSD adapter and cable -> same problem
- lastly they just replaced the whole motherboard -> fixed the problem for Windows
So it seems I definitely had a hardware problem.
After all the problems I decided not to install Linux anymore, since I discovered you can now run Linux natively on Windows (it's called the Windows Subsystem for Linux or WSL, cool stuff!).
So in the end I'm not sure if I ALSO had a software problem, but I'm not gonna risk it by going back to Linux...

Changed in linux (Ubuntu):
status: Confirmed → Invalid
Revision history for this message
walter (walter-mollica) wrote :

Same thing here.
Dell XPS 9560, Samsung SM961 512GB NVMe SSD, Ubuntu 18.04.2, Linux 4.18.0-16-generic.

Brad Figg (brad-figg)
tags: added: ubuntu-certified
tags: added: cscc
Revision history for this message
Sebastian (szm) wrote :

Same problem with T470s and SAMSUNG MZVLW1T0HMLH-000L7

sqid : 0
cmdid : 0x18
status_field : 0x4004(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
parm_err_loc : 0x2c
lba : 0
nsid : 0
vs : 0
cs : 0
.................

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.