[ kernel 6.5 regression ] nvme not working on some laptops

Bug #2039601 reported by Luis Alberto Pabón
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
High
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

With the update to Ubuntu 23.10 my nvme drive ceased to function. Booting the old 6.2 kernel from 23.04 works, but not the newer 6.5.0-9.

This is a kernel bug that's been fixed in 6.5.6, any chance it could possibly be backported?

More information: https://bugzilla.kernel.org/show_bug.cgi?id=217802

ProblemType: Bug
DistroRelease: Ubuntu 23.10
Package: linux-image-6.5.0-9-generic 6.5.0-9.9
ProcVersionSignature: Ubuntu 6.5.0-9.9-generic 6.5.3
Uname: Linux 6.5.0-9-generic x86_64
NonfreeKernelModules: zfs
ApportVersion: 2.27.0-0ubuntu5
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: luis 3036 F.... wireplumber
 /dev/snd/controlC1: luis 3036 F.... wireplumber
 /dev/snd/seq: luis 3029 F.... pipewire
CRDA: N/A
CasperMD5CheckResult: pass
CurrentDesktop: ubuntu:GNOME
Date: Tue Oct 17 21:05:29 2023
InstallationDate: Installed on 2023-10-16 (1 days ago)
InstallationMedia: Ubuntu Legacy 23.10 "Mantic Minotaur" - Release amd64 (20231010)
MachineType: {report['dmi.sys.vendor']} {report['dmi.product.name']}
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-6.5.0-9-generic root=UUID=014fd29c-595e-4b8b-aedc-34e1eb9ab082 ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-6.5.0-9-generic N/A
 linux-backports-modules-6.5.0-9-generic N/A
 linux-firmware 20230919.git3672ccab-0ubuntu2.1
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 11/17/2019
dmi.bios.release: 1.18
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.18.0
dmi.board.name: 05FFDN
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 10
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.18.0:bd11/17/2019:br1.18:svnDellInc.:pnXPS159560:pvr:rvnDellInc.:rn05FFDN:rvrA00:cvnDellInc.:ct10:cvr:sku07BE:
dmi.product.family: XPS
dmi.product.name: XPS 15 9560
dmi.product.sku: 07BE
dmi.sys.vendor: Dell Inc.

Revision history for this message
In , gjunk2 (gjunk2-linux-kernel-bugs) wrote :

Failure manually transcribed:

kernel: nvme nvme0: controller is down; will reset: CSTS:0xffffffff, PCI_STATUS=0xffff
kernel: nvme nvme0: Does your device have a faulty power saving mode enabled?
kernel: nvme nvme0: try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
kernel: nvme 0000:04:00.0: Unable to change power state from D3cold to D0, device inaccessible
kernel: nvme nvme0: Disabling device after reset failure: -19
mount[353]: mount /sysroot: can't read suprtblock on /dev/nvme0n1p5.
mount[353]: dmesg(1) may have more information after failed moutn system call.
kernel: nvme0m1: detected capacity change from 2000409264 to 0
kernel: EXT4-fs (nvme0n1p5): unable to read superblock
systemd([1]: sysroot.mount: Mount process exited, code=exited, status=32/n/a
...

All kernels are upstream, untainted and compiled on Arch linux using:

 gcc version 13.2.1

Kernels Tested:
 - 6.4.10 - works fine
 - 6.5-rc6 - fails
 - 6.4.11 with 1 revert also fails

    Revert "nvme-pci: add NVME_QUIRK_BOGUS_NID for Samsung PM9B1 256G and 512G"

    This reverts commit 061fbf64825fb47367bbb6e0a528611f08119473.

Hardware:
  model name : Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
  stepping : 9
  microcode : 0xf4

nvme:
04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961/SM963

All tests on dell laptop running Arch. All

Revision history for this message
In , gjunk2 (gjunk2-linux-kernel-bugs) wrote :

Also I did try 6.4.11 with the suggested options :
   nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Also did not boot.

Revision history for this message
In , gjunk2 (gjunk2-linux-kernel-bugs) wrote :

git bisect results on lkml

https://lkml.org/lkml/2023/8/16/1363

Revision history for this message
In , gjunk2 (gjunk2-linux-kernel-bugs) wrote :

Just FYI in case of interest to anyone.

I can confirm that blacklisting the drivers (rtsx_pci_and sdmmc and rtsx_pci) and rebuilding the initramfs - rebooting then works fine for both 6.4.11 and 6.5-rc6.

Revision history for this message
In , jastxakajasmineteax (jastxakajasmineteax-linux-kernel-bugs) wrote :

(In reply to Gene from comment #1)
> Also I did try 6.4.11 with the suggested options :
> nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
>
> Also did not boot.

Hello,
I'm facing this same problem with linux-mainline-6.5rc6-1 (built by Chaotic-AUR), linux-zen-6.4.12 and linux-lts-6.1.47-1. OS is Garuda Linux. I understand that here, support is not given for downstream kernels like Zen and LTS.

In my case, adding
    nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
did fix it for me and some others facing similar issues (they didn't get thrown into an emergency shell after failing to switch root though - they got stuck on a black screen instead). None of us tried blacklisting the kernels, as these boot params suggested by the error worked.
Everyone affected by this used NVMe devices, a lot of them from Samsung.

I use a Dell XPS 15 9560 (Toshiba KXG50ZNV512G NVMe 512GB). It has the problematic Realtek card reader.
I'm unsure if I should make a new report since the problem is only slightly different, with newer kernels. Reporting kernel bugs is very new to me so please let me know the right course of action for reporting this :) (not just Gene, but anyone here).

Revision history for this message
In , jastxakajasmineteax (jastxakajasmineteax-linux-kernel-bugs) wrote :

(In reply to Jasmine T from comment #4)
> None of us tried blacklisting the kernels

Sorry, typo... modules, not kernels. Need sleep.

Revision history for this message
In , bronecki.damian (bronecki.damian-linux-kernel-bugs) wrote :

I have same issues since 6.4.11 on my Dell XPS 15 9560 laptop using Fedora 38.

Revision history for this message
In , info (info-linux-kernel-bugs) wrote :

Same issue here on 6.4.11 or higher

Dell Precision 5520
sn: X7AS11Z7TYAT
model: KXG50ZNV1T02 NVMe TOSHIBA 1024GB
lspci: Toshiba Corporation XG5 NVMe SSD Controller (prog-if 02 [NVM Express])

Revision history for this message
In , fergalmt (fergalmt-linux-kernel-bugs) wrote :

Hi !

I have the same issue here, with a DELL XPS 15 9560, like Damian B.

Revision history for this message
In , timkniel (timkniel-linux-kernel-bugs) wrote :

Same issue, Dell Precision 5520.

Revision history for this message
In , carlon.luca (carlon.luca-linux-kernel-bugs) wrote :

I also experienced this issue. All kernels suddenly stopped booting: 6.5, 6.4, 6.1 and 5.15. 6.1 stops working from 6.1.45 to 6.1.46.

By bisection I can say that, after this commit, boot of 6.1 fails: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8ee39ec479147e29af704639f8e55fce246ed2d9. Same already mentioned.

Machine is:
Dell Precision 5520
SSD: PM961 NVMe SED Samsung 512GB

I also noticed another unrelated issue, so I decided to replace my SSD with a Samsung SSD 970 EVO Plus 1TB. This seems to solve both issues and I can now boot whatever kernel version I tested.

Revision history for this message
In , nazar (nazar-linux-kernel-bugs) wrote :

So happy I'm not alone, suffering for over a week due to this with Samsung 890 Pro 2TB:

02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO (prog-if 02 [NVM Express])
 Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM9A1/PM9A3/980PRO
 Flags: bus master, fast devsel, latency 0, IRQ 16, NUMA node 0, IOMMU group 17
 Memory at 85000000 (64-bit, non-prefetchable) [size=16K]
 Capabilities: <access denied>
 Kernel driver in use: nvme
 Kernel modules: nvme

Tried different motherboard, bought another CPU to try, will try older kernel now.

Here are kernel logs:
[ 2762.189019] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[ 2762.189022] nvme nvme0: Does your device have a faulty power saving mode enabled?
[ 2762.189022] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[ 2762.254958] nvme 0000:02:00.0: enabling device (0000 -> 0002)
[ 2762.255161] nvme nvme0: Disabling device after reset failure: -19
[ 2762.271015] I/O error, dev nvme0n1, sector 178296536 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2
[ 2762.271044] kworker/u64:12: attempt to access beyond end of device

Revision history for this message
In , nazar (nazar-linux-kernel-bugs) wrote :

My issue must be different, having it with kernels down to 6.4.3 on ASUS PRIME Z690-P D4.

Revision history for this message
In , regressions (regressions-linux-kernel-bugs) wrote :

Just to clarify: is the latest 6.5.y version and/or 6.6-rc1 broken as well?

Revision history for this message
In , nikof.06 (nikof.06-linux-kernel-bugs) wrote :

Same here with a Dell XPS 9560.

The issue originally manifested with the stock Toshiba SSD (THNSN5256GPUK 256GB). I tried replacing it with a WD_BLACK SN770 1TB, same behaviour.

With the new WD SSD installed, adding the kernel parameters "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" does not seem to help.

I tried the following Kernels:
- 6.1.52-1 (lts): doesn't work
- 6.4.8: works out of the box
- 6.5.2: doesn't work

Revision history for this message
In , miso (miso-linux-kernel-bugs) wrote :

openSUSE Tumbleweed already applied patch to kernel 6.4.12 and I can confirm, it works on my XPS 15 9560

Revision history for this message
In , regressions (regressions-linux-kernel-bugs) wrote :

(In reply to Michal Hlavac from comment #15)
> openSUSE Tumbleweed already applied patch

What patch? A revert of 101bd907b4244a ("misc: rtsx: judge ASPM Mode to set PETXCFG Reg")?

Revision history for this message
In , miso (miso-linux-kernel-bugs) wrote :

Sorry, I don't know, downstream ticket is https://bugzilla.suse.com/show_bug.cgi?id=1214428
Maybe it will help you

Revision history for this message
In , kernel_bugzilla (kernelbugzilla-linux-kernel-bugs) wrote :

(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #16)
> What patch? A revert of 101bd907b4244a ("misc: rtsx: judge ASPM Mode to set
> PETXCFG Reg")?
Yes https://github.com/openSUSE/kernel-source/commit/1b02b1528a26f4e9b577e215c114d8c5e773ee10

It is reported as still present on 6.5.2 in https://bugs.archlinux.org/task/79439#comment221866

Revision history for this message
In , jastxakajasmineteax (jastxakajasmineteax-linux-kernel-bugs) wrote :

(In reply to The Linux kernel's regression tracker (Thorsten Leemhuis) from comment #13)
> Just to clarify: is the latest 6.5.y version and/or 6.6-rc1 broken as well?

Bug is still present on 6.6rc1-1 (using build from Chaotic-AUR).
Errors are exactly the same as what I previously experienced. More details including errors and system specs here:
https://forum.garudalinux.org/t/dumped-into-emergency-shell-after-update-failed-to-start-switch-root-btrfs-errors/30440

Dell XPS 9560 i1-7700HQ with Toshiba KXG50ZNV512G NVMe 512GB (completely stock model)

Revision history for this message
In , jade (jade-linux-kernel-bugs) wrote :

Hi all,

There are several people hitting this, also on the 9560, downstream at NixOS. I have confirmed that the revert on 6.4 fixes my machine booting.

This is our bug: https://github.com/NixOS/nixpkgs/issues/253418, and there is a bunch of troubleshooting here: https://discourse.nixos.org/t/nvme-drive-not-detecting-after-calameres-initiates/32108

My plan is to submit a change to revert the patch on all supported kernels in NixOS, following with OpenSUSE.

Revision history for this message
In , aros (aros-linux-kernel-bugs) wrote :

The issue has been known for over a month now yet the bad commit has still not been reverted in both mainline and stable. No idea what's going on.

Revision history for this message
In , hi (hi-linux-kernel-bugs) wrote :

It looks like the fix is still waiting for Tested-by tags from people affected by this issue: https://<email address hidden>/

You could test it and submit one. ;)

Revision history for this message
In , regressions (regressions-linux-kernel-bugs) wrote :

Yeah, tested-by would likely help; FWIW, I was and still am unhappy about how this regression is handled, but CCing Linus ~two weeks ago and pointing him to the discussion yesterday[1] didn't lead to any visible action from his side. :-/

[1] https://<email address hidden>/

Revision history for this message
In , regressions (regressions-linux-kernel-bugs) wrote :

FWIW, testing is always helpful in cases like this, but not needed anymore, asthings will likely proceed soon anyway:
https://lore.kernel.org/all/2023092522-climatic-commend-8c99@gregkh/

Revision history for this message
In , eva.ko.878 (eva.ko.878-linux-kernel-bugs) wrote :

Ah I was just about to test the patch... awesome to hear ^^ thank you
everyone for your hard work on this regression.

On 25/9/23 7:13 pm, <email address hidden> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=217802
>
> --- Comment #24 from The Linux kernel's regression tracker (Thorsten
> Leemhuis) (<email address hidden>) ---
> FWIW, testing is always helpful in cases like this, but not needed anymore,
> asthings will likely proceed soon anyway:
> https://lore.kernel.org/all/2023092522-climatic-commend-8c99@gregkh/
>

Revision history for this message
In , gjunk2 (gjunk2-linux-kernel-bugs) wrote :

Resolved in 6.6-rc4
Should be in 6.5.6 stable as well.

Revision history for this message
Luis Alberto Pabón (copong) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
description: updated
summary: - [ kernel regression ] nvme not working on some laptops
+ [ kernel 6.5 regression ] nvme not working on some laptops
Changed in linux:
importance: Unknown → High
status: Unknown → Fix Released
Revision history for this message
Eric Rouleau (xblitz) wrote :

the issue seems fixed with latest kernel 6.5.0.12

Revision history for this message
Luis Alberto Pabón (copong) wrote (last edit ):

The newest kernel available for mantic, including "proposed", is 6.5.0-10, where did you see -12?

Revision history for this message
Luis Alberto Pabón (copong) wrote :

Found it thank you, I was tripped up by the fact it doesn't show as an update but of course it doesn't.

The kernel does seem to fix the nvme problem, but it introduces issues potentially with the graphics stack (half size Plymouth spinner and ubuntu logo on a 4k display and GDM hangs forever)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.