amdgpu reset during usage of firefox

Bug #2039868 reported by Pirouette Cacahuète
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Linux
Unknown
Unknown
linux (Ubuntu)
Confirmed
Undecided
Unassigned
mesa (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Running nightly on 23.10 (since monday), I have been experiencing a few amdgpu resets in the past hours

ProblemType: Bug
DistroRelease: Ubuntu 23.10
Package: linux-image-6.5.0-9-generic 6.5.0-9.9
ProcVersionSignature: Ubuntu 6.5.0-9.9-generic 6.5.3
Uname: Linux 6.5.0-9-generic x86_64
ApportVersion: 2.27.0-0ubuntu5
Architecture: amd64
CasperMD5CheckResult: pass
CurrentDesktop: ubuntu:GNOME
Date: Thu Oct 19 18:26:43 2023
HibernationDevice: RESUME=/dev/mapper/vg--ubuntu-lv--ubuntu--swap
InstallationDate: Installed on 2022-07-04 (472 days ago)
InstallationMedia: Ubuntu 22.04 LTS "Jammy Jellyfish" - Release amd64 (20220419)
MachineType: {report['dmi.sys.vendor']} {report['dmi.product.name']}
ProcEnviron:
 LANG=fr_FR.UTF-8
 PATH=(custom, no user)
 SHELL=/bin/bash
 TERM=xterm-256color
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.5.0-9-generic root=/dev/mapper/vg--ubuntu-lv--ubuntu--root ro rootflags=subvol=@ quiet splash resume=/dev/mapper/vg--ubuntu-lv--ubuntu--swap vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-6.5.0-9-generic N/A
 linux-backports-modules-6.5.0-9-generic N/A
 linux-firmware 20230919.git3672ccab-0ubuntu2.1
SourcePackage: linux
UpgradeStatus: Upgraded to mantic on 2023-10-16 (3 days ago)
dmi.bios.date: 05/15/2023
dmi.bios.release: 1.24
dmi.bios.vendor: LENOVO
dmi.bios.version: R1MET54W (1.24 )
dmi.board.asset.tag: Not Available
dmi.board.name: 21A0CTO1WW
dmi.board.vendor: LENOVO
dmi.board.version: Not Defined
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.ec.firmware.release: 1.24
dmi.modalias: dmi:bvnLENOVO:bvrR1MET54W(1.24):bd05/15/2023:br1.24:efr1.24:svnLENOVO:pn21A0CTO1WW:pvrThinkPadP14sGen2a:rvnLENOVO:rn21A0CTO1WW:rvrNotDefined:cvnLENOVO:ct10:cvrNone:skuLENOVO_MT_21A0_BU_Think_FM_ThinkPadP14sGen2a:
dmi.product.family: ThinkPad P14s Gen 2a
dmi.product.name: 21A0CTO1WW
dmi.product.sku: LENOVO_MT_21A0_BU_Think_FM_ThinkPad P14s Gen 2a
dmi.product.version: ThinkPad P14s Gen 2a
dmi.sys.vendor: LENOVO

Revision history for this message
In , felix.adrianto (felix.adrianto-linux-kernel-bugs) wrote :

Error message:
[Dec 5 22:08] amdgpu 0000:23:00.0: GPU fault detected: 146 0x0000480c for process yuzu pid 2920 thread yuzu:cs0 pid 2935
[ +0.000005] amdgpu 0000:23:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00000000
[ +0.000002] amdgpu 0000:23:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0604800C
[ +0.000003] amdgpu 0000:23:00.0: VM fault (0x0c, vmid 3, pasid 32770) at page 0, read from 'TC4' (0x54433400) (72)
[ +10.053011] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=37241, emitted seq=37244
[ +0.000007] [drm] GPU recovery disabled.

How to reproduce the issue:
1. Playing with yuzu-emulator
2. Load Super Mario Odyssey
3. Start new game
4. When Mario is about to jump for the first time after being woken up by Cappy, this bug must occur.

During the issue, the following occured:
1. Graphic locked up.
2. System can be access through SSH.

System specification:
Debian Sid
Radeon RX 580

I have tried the following combination:
1. Kernel 4.17, 4.18, 4.19, 4.20, drm-next-4.21.wip
2. Mesa 18.2, 18.3, 19.0-development branch

But none of the above combination fixes the issue. Let me know if you need more information and more testing from me.

Revision history for this message
In , alexdeucher (alexdeucher-linux-kernel-bugs) wrote :

This is more likely a mesa issue than a kernel issue.

Revision history for this message
In , felix.adrianto (felix.adrianto-linux-kernel-bugs) wrote :

I will try to test with amdgpu-pro sometimes this week with the kernel that I mentioned above. If the application works as expected, it could be an issue with mesa opengl bug.

Revision history for this message
In , anode.dev (anode.dev-linux-kernel-bugs) wrote :
Download full text (4.5 KiB)

(In reply to Alex Deucher from comment #1)
> This is more likely a mesa issue than a kernel issue.

no, 4.14 kernel with latest mesa libs works very vell without any stucks
but from 4.20.4 and in all latest kernels (including 5.0) OS freezes and stucks every 30s ... 1min for 30s when browsing youtube with HW acceleration enabled(uvd) or playing a game, RX550, Arch, vanilla kernel

  365.021164] amdgpu: [powerplay]
                last message was failed ret is 0
[ 365.045198] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 365.570667] amdgpu: [powerplay]
                failed to send message 133 ret is 0
[ 366.115228] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=9365, emitted seq=9365
[ 366.115377] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[ 366.115388] [drm] Timeout, but no hardware hang detected.
[ 366.689407] amdgpu: [powerplay]
                last message was failed ret is 0
[ 367.232287] amdgpu: [powerplay]
                failed to send message 306 ret is 0
[ 367.787043] amdgpu: [powerplay]
                last message was failed ret is 0
[ 368.320138] amdgpu: [powerplay]
                failed to send message 5e ret is 0
[ 369.367739] amdgpu: [powerplay]
                last message was failed ret is 0
[ 369.907559] amdgpu: [powerplay]
                failed to send message 145 ret is 0
[ 370.994478] amdgpu: [powerplay]
                last message was failed ret is 0
[ 371.538753] amdgpu: [powerplay]
                failed to send message 146 ret is 0
[ 372.075079] amdgpu: [powerplay]
                last message was failed ret is 0
[ 372.598565] amdgpu: [powerplay]
                failed to send message 148 ret is 0
[ 373.657188] amdgpu: [powerplay]
                last message was failed ret is 0
[ 374.198637] amdgpu: [powerplay]
                failed to send message 145 ret is 0
[ 375.075076] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 375.284948] amdgpu: [powerplay]
                last message was failed ret is 0
[ 375.830347] amdgpu: [powerplay]
                failed to send message 146 ret is 0
[ 376.138428] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=10113, emitted seq=10113
[ 376.138783] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
[ 376.138797] [drm] IP block:sdma_v3_0 is hung!
[ 376.138809] [drm] GPU recovery disabled.
[ 376.394657] amdgpu: [powerplay]
                last message was failed ret is 0
[ 376.934375] amdgpu: [powerplay]
                failed to send message 16a ret is 0
[ 377.463230] amdgpu: [powerplay]
                last message was failed ret is 0
[ 377.977725] amdgpu: [powerplay]
                failed to send message 186 ret is 0
[ 378.518406] amdgpu: [powerplay]
                last message was failed ret is 0
[ 379.060098] amdgpu: [powerplay]
                failed to send message 54 ret is 0
[ 379.556880] amdgpu: [powerplay]
                last message was failed ret is 0
[ 380.075217] amdgpu: [powerp...

Read more...

Revision history for this message
In , alexdeucher (alexdeucher-linux-kernel-bugs) wrote :

Can you bisect?

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :
Download full text (10.4 KiB)

I'm having a very similar issue, running Linux Mint 19.1. The issue has persisted from at least 4.15, I'm currently running 5.0.1 and the issue remains.

Here is the latest syslog of the error:

[37258.615599] gmc_v9_0_process_interrupt: 10 callbacks suppressed
[37258.615608] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615615] amdgpu 0000:06:00.0: in page starting at address 0x0000800107805000 from 27
[37258.615619] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
[37258.615629] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615633] amdgpu 0000:06:00.0: in page starting at address 0x0000800107807000 from 27
[37258.615636] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615645] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615648] amdgpu 0000:06:00.0: in page starting at address 0x0000800107801000 from 27
[37258.615651] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615660] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615663] amdgpu 0000:06:00.0: in page starting at address 0x0000800107803000 from 27
[37258.615666] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615675] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615678] amdgpu 0000:06:00.0: in page starting at address 0x0000800107809000 from 27
[37258.615681] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615689] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615692] amdgpu 0000:06:00.0: in page starting at address 0x000080010780b000 from 27
[37258.615695] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615704] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615707] amdgpu 0000:06:00.0: in page starting at address 0x0000800107805000 from 27
[37258.615710] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615740] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615743] amdgpu 0000:06:00.0: in page starting at address 0x0000800107807000 from 27
[37258.615746] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615756] amdgpu 0000:06:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1287 thread Xorg:cs0 pid 1317)
[37258.615759] amdgpu 0000:06:00.0: in page starting at address 0x0000800107801000 from 27
[37258.615762] amdgpu 0000:06:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[37258.615771] amdgpu 0000:06:00.0: [gfxhub] VMC page fau...

Revision history for this message
In , anode.dev (anode.dev-linux-kernel-bugs) wrote :
Download full text (15.4 KiB)

tried linux-amd-staging-drm-next-git-5.1.811103.2acb851ad43b and dmes is still has a lot of warnings. Tested also youtube in chrome with UVD, got a minor freeze and long freeze ~30sec of system

Apr 01 21:01:03 kernel: amdgpu 0000:03:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring uvd_enc0 test failed (-110)
Apr 01 21:01:03 kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <uvd_v6_0> failed -110
Apr 01 21:01:03 kernel: [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_device_ip_resume failed (-110).

Apr 01 20:26:59 kernel: [drm] amdgpu kernel modesetting enabled.
Apr 01 20:26:59 kernel: vga_switcheroo: detected switching method \_SB_.PCI0.VGA_.ATPX handle
Apr 01 20:26:59 kernel: [drm] initializing kernel modesetting (CARRIZO 0x1002:0x9874 0x1025:0x1201 0xCA).
Apr 01 20:26:59 kernel: [drm] register mmio base: 0xD1500000
Apr 01 20:26:59 kernel: [drm] register mmio size: 262144
Apr 01 20:26:59 kernel: [drm] add ip block number 0 <vi_common>
Apr 01 20:26:59 kernel: [drm] add ip block number 1 <gmc_v8_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 2 <cz_ih>
Apr 01 20:26:59 kernel: [drm] add ip block number 3 <gfx_v8_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 4 <sdma_v3_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 5 <powerplay>
Apr 01 20:26:59 kernel: [drm] add ip block number 6 <dm>
Apr 01 20:26:59 kernel: [drm] add ip block number 7 <uvd_v6_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 8 <vce_v3_0>
Apr 01 20:26:59 kernel: [drm] add ip block number 9 <acp_ip>
Apr 01 20:26:59 kernel: [drm] UVD is enabled in physical mode
Apr 01 20:26:59 kernel: [drm] VCE enabled in physical mode
Apr 01 20:26:59 kernel: ATOM BIOS: 113-C91400-007
Apr 01 20:26:59 kernel: [drm] RAS INFO: ras initialized successfully, hardware ability[0] ras_mask[0]
Apr 01 20:26:59 kernel: [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
Apr 01 20:26:59 kernel: amdgpu 0000:00:01.0: VRAM: 512M 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
Apr 01 20:26:59 kernel: amdgpu 0000:00:01.0: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
Apr 01 20:26:59 kernel: [drm] Detected VRAM RAM=512M, BAR=512M
Apr 01 20:26:59 kernel: [drm] RAM width 64bits UNKNOWN
Apr 01 20:26:59 kernel: [TTM] Zone kernel: Available graphics memory: 3804974 KiB
Apr 01 20:26:59 kernel: [TTM] Zone dma32: Available graphics memory: 2097152 KiB
Apr 01 20:26:59 kernel: [TTM] Initializing pool allocator
Apr 01 20:26:59 kernel: [TTM] Initializing DMA pool allocator
Apr 01 20:26:59 kernel: [drm] amdgpu: 512M of VRAM memory ready
Apr 01 20:26:59 kernel: [drm] amdgpu: 3072M of GTT memory ready.
Apr 01 20:26:59 kernel: [drm] GART: num cpu pages 262144, num gpu pages 262144
Apr 01 20:26:59 kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F4007E9000).
Apr 01 20:26:59 kernel: [drm] Found UVD firmware Version: 1.91 Family ID: 11
Apr 01 20:26:59 kernel: [drm] UVD ENC is disabled
Apr 01 20:26:59 kernel: [drm] Found VCE firmware Version: 52.4 Binary ID: 3
Apr 01 20:26:59 kernel: smu version 27.17.00
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: values for Engine clock
Apr 01 20:26:59 kernel: [drm] DM_PPLIB: 30000...

Revision history for this message
In , anode.dev (anode.dev-linux-kernel-bugs) wrote :

(In reply to Alex Deucher from comment #4)
> Can you bisect?

Unfortunately this is not possible as all latest kernels are now shipped with Display Core enabled by default and as I told 4.14 vanilla kernel works like a charm on same HW and with same mesa libs - no lags, no stucks or freezes and no warnings like listed above. So it's no sense to do "git bisect" as it's not a single commit which works incorrectly with GPU. DC - this a completely new functionality which replaces old amdgpu code

Revision history for this message
In , au1064 (au1064-linux-kernel-bugs) wrote :

Hi, i have a very similar problem. My system is working with 4.15 and with 5.1.16 but not with other 5.x kernels:

The System does not boot with 5.x kernels. With 5.1.16 the gui system freezes sometimes but sshd and mouse is still working.

CPU: Ryzen 5 2400g, BOARD: AORUS B450 I PRO WIFI, X Server 1.19.6

Kernel 5.0.x not working (blank screen after boot)
Kernel 5.2.x ( x <= 9 ) is not working (blank screen after boot)

but Kernel 5.1.16 is working (mostly)!

Error LOG with 5.1.16:
[Mi Aug 14 14:22:21 2019] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mi Aug 14 14:22:21 2019] amdgpu 0000:09:00.0: [gfxhub] no-retry page fault (src_id:0 ring:24 vmid:3 pasid:32768, for process Xorg pid 1848 thread Xorg:cs0 pid 1849)
[Mi Aug 14 14:22:21 2019] amdgpu 0000:09:00.0: in page starting at address 0x000080010c205000 from 27
[Mi Aug 14 14:22:21 2019] amdgpu 0000:09:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[Mi Aug 14 14:22:31 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=840738, emitted seq=840740
[Mi Aug 14 14:22:31 2019] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1848 thread Xorg:cs0 pid 1849
[Mi Aug 14 14:22:31 2019] [drm] GPU recovery disabled.

Revision history for this message
In , ungu_93 (ungu93-linux-kernel-bugs) wrote :

Just got something similar while playing Left 4 Dead. The system simply froze with altered colors on the screen and the sound just looping over the last second or so. Cannot confirm SSH access.

journalctl -b -1 ends with

[drm:gfx_v8_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2225992, emitted seq=2225993
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process hl2_linux pid 12532 thread hl2_

OS: Ubuntu 19.04 on
Kernel: 5.0.0-27-generic
GPU: Radeon RX580
CPU: Ryzen 5 1600x

Thanks!

Revision history for this message
In , anode.dev (anode.dev-linux-kernel-bugs) wrote :

(In reply to Ungureanu Alexandru from comment #9)
> Just got something similar while playing Left 4 Dead. The system simply
> froze with altered colors on the screen and the sound just looping over the
> last second or so. Cannot confirm SSH access.

> Kernel: 5.0.0-27-generic
> GPU: Radeon RX580
> CPU: Ryzen 5 1600x

5.0 is very outdated kernel, use latest from kernel.org

as for me all works perfectly in 5.3 (Chip polaris RX540)
finally I have no more any errors like these ones:
- ERROR* resume of IP block <uvd_v6_0> failed -110
- [drm] Fence fallback timer expired on ring sdma0
- last message was failed ret is **
- [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq...
- IP block:sdma_v3_0 is hung!
- Timeout, but no hardware hang detected.

Tested on youtube with HW accelerated video and in several games
Thank you guys from AMD a lot, I had to wait 1y+ to get these bugs fixed

Revision history for this message
In , lekto (lekto-linux-kernel-bugs) wrote :

Same problem here. It happens when I run looking-glass [1], but not everytime. I tied downgrading my kernel from 5.3.1 to 5.2.11 (I'm pretty sure it worked then), downgrading mesa from 19.2.0 to 19.1.7 (I'm sure it worked with 19.2.0-rc) and downgrading my firmware to 2019-09-23 (oldest in repo).

When it happens looking glass starts blinking and sometimes my other monitor stuck that I can only move cursor on it.

Spec:
Gentoo ~amd64
Ryzen 1600 (other have Ryzen too, coincidence?)
Linux GPU: R7 240 (with radeon driver)
Windows GPU: RX580
ASRock X370 Gaming X

[1] https://looking-glass.hostfission.com/

Revision history for this message
In , mh (mh-linux-kernel-bugs) wrote :

Hi,

I think I have the same bug and opened https://bugzilla.kernel.org/show_bug.cgi?id=204683.

At first it looked a bit different, because in newer kernels the error message has changed. But as you can see I did some testing and this seems to go way back. Sadly I couldn't test a 4.18 kernel.

Can somebody mark my report as duplicate? Because I think it is.

And Would some more debug info help?

Revision history for this message
In , mh (mh-linux-kernel-bugs) wrote :

*** Bug 204683 has been marked as a duplicate of this bug. ***

Revision history for this message
In , perk11 (perk11-linux-kernel-bugs) wrote :
Download full text (6.0 KiB)

Also experiencing this with Radeon RX 5700 XT and amdgpu 19.1.0+git1910111930.b467d2~oibaf~b

Didn't have any heavy load for the GPU to do.

First I had some artifacts appeared on Plasma Hard Disk Monitor widget and CPU Load Widget (here is a screenshot: https://i.perk11.info/20191024_193152_kernel.png) while PC was idle and screen was locked, but everything else continued to work fine.

I checked the logs for the period when this could've happened, but the only logs from that period are from KScreen that start like this:

Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRNotify_OutputProperty (ignored)
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Output: 88
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Property: EDID
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: State (newValue, Deleted): 1
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRNotify_OutputProperty (ignored)
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Output: 88
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Property: EDID
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: State (newValue, Deleted): 1
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRNotify_OutputChange
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Output: 88
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: CRTC: 81
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Mode: 97
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Rotation: "Rotate_0"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Connection: "Disconnected"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Subpixel Order: 0
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRScreenChangeNotify
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Window: 18874373
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Root: 1744
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Rotation: "Rotate_0"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Size ID: 65535
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Size: 7280 1440
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: SizeMM: 1926 381
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRNotify_OutputChange
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Output: 88
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: CRTC: 81
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Mode: 97
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: Rotation: "Rotate_0"
Oct 24 16:34:58 perk11-home org.kde.KScreen...

Read more...

Revision history for this message
In , perk11 (perk11-linux-kernel-bugs) wrote :

My kernel version is 5.3.7-050307-generic running KDE Neon User edition with latest updates.

Revision history for this message
In , shallowaloe (shallowaloe-linux-kernel-bugs) wrote :

Created attachment 285665
5 second video clip that triggers a crash

Hi,

I think I'm having the same problem as you guys. I run a mythbackend where I record cable television and those recordings often crash my system when hardware decoding is enabled. Usually it's just the screen that freezes and I can still ssh to it.

Kernel 5.1.6 was an exception for me too, with that kernel I'm able to restart the display manager and recover without having to reboot.

Attached is a short video that crashes my system. I can trigger the alert by running:

mpv --vo=vaapi out.ts

I'm wondering if it crashes your systems too and if it's related.

Revision history for this message
In , jmstylr (jmstylr-linux-kernel-bugs) wrote :
Download full text (4.5 KiB)

(In reply to shallowaloe from comment #16)
> Created attachment 285665 [details]
> 5 second video clip that triggers a crash
>
> Hi,
>
> I think I'm having the same problem as you guys. I run a mythbackend where
> I record cable television and those recordings often crash my system when
> hardware decoding is enabled. Usually it's just the screen that freezes and
> I can still ssh to it.
>
> Kernel 5.1.6 was an exception for me too, with that kernel I'm able to
> restart the display manager and recover without having to reboot.
>
> Attached is a short video that crashes my system. I can trigger the alert
> by running:
>
> mpv --vo=vaapi out.ts
>
> I'm wondering if it crashes your systems too and if it's related.

Just to add a data point, I tried running `mpv --vo=vaapi out.ts` against your file, and while it crashed the application, it did not freeze the system.

My hardware is a Ryzen 3700X with a Radeon RX 5700, running Ubuntu 19.10 with default kernel (5.3.0-19-generic).

The command did result in the following lines in /var/log/syslog repeated every 5 seconds:

Nov 10 07:04:23 redacted kernel: [ 2266.802162] gmc_v10_0_process_interrupt: 23900 callbacks suppressed
Nov 10 07:04:23 redacted kernel: [ 2266.802166] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802170] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802171] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0000213D
Nov 10 07:04:23 redacted kernel: [ 2266.802176] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802178] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802179] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Nov 10 07:04:23 redacted kernel: [ 2266.802566] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802568] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802569] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0000213D
Nov 10 07:04:23 redacted kernel: [ 2266.802573] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802575] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802576] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Nov 10 07:04:23 redacted kernel: [ 2266.802984] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802985] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802987] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x0000213D
Nov 10 07:04:23 redacted kernel: [ 2266.802993] amdgpu 0000:0b:00.0: [mmhub] VMC page fault (src_id:0 ring:158 vmid:0 pasid:0)
Nov 10 07:04:23 redacted kernel: [ 2266.802994] amdgpu 0000:0b:00.0: at page 0x0000000000000000 from 18
Nov 10 07:04:23 redacted kernel: [ 2266.802995] amdg...

Read more...

Revision history for this message
In , mh (mh-linux-kernel-bugs) wrote :

Hi,

I recently built a 5.4.0-rc7 from drm-next (my HEAD was 17eee668b3cad423a47c090fe2275733c55db910) and also updated Mesa to 19.3.0-RC1.

Since then I didn't get any crashes. I have tested this for a few hours now, but it's entirely possible that I just didn't run into the bug for some reason, although it usually appeared after half an hour.

If possible please try this setup and see if it is fixed.

Revision history for this message
In , j.cordoba (j.cordoba-linux-kernel-bugs) wrote :

Hi,

This issue is still present in the latest kernels:

5.4.1, 5.4, 5.3.14

Last usable kernel for me is 4.20.17

System Specs

- Gigabyte b450-ds3h
- Ryzen 5 3400G (with RX Vega 11)
- Mesa 19.1.2 - padoka PPA (Stable)
- Ubuntu 18.04.3 LTS

Revision history for this message
In , mh (mh-linux-kernel-bugs) wrote :

Dear j.cordoba,

is it possible that you try to build 5.4.0-rc7 from drm-next and give it a test as I mentioned in Comment 18?

I'm running on this for some time now and the bug should have appeared by now, so I'm getting more confident that it is fixed.

Best regards
Matthias

Revision history for this message
In , lukasz (lukasz-linux-kernel-bugs) wrote :

Same is happening to me on 5.4.1. No issue with 4.9.

[ 44.172714] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[ 49.292694] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 58.469316] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[ 63.586055] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[ 156.606591] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

Revision history for this message
In , pierre-eric.pelloux-prayer (pierre-eric.pelloux-prayer-linux-kernel-bugs) wrote :

(In reply to shallowaloe from comment #16)
> Created attachment 285665 [details]
> 5 second video clip that triggers a crash
>
> Hi,
>
> I think I'm having the same problem as you guys. I run a mythbackend where
> I record cable television and those recordings often crash my system when
> hardware decoding is enabled. Usually it's just the screen that freezes and
> I can still ssh to it.
>
> Kernel 5.1.6 was an exception for me too, with that kernel I'm able to
> restart the display manager and recover without having to reboot.
>
> Attached is a short video that crashes my system. I can trigger the alert
> by running:
>
> mpv --vo=vaapi out.ts
>
> I'm wondering if it crashes your systems too and if it's related.

This one is probably a Mesa issue, see https://gitlab.freedesktop.org/mesa/mesa/issues/2177

What Mesa version are you using?

Revision history for this message
In , shallowaloe (shallowaloe-linux-kernel-bugs) wrote :

Created attachment 286227
attachment-25111-0.html

Thanks for the link to the bug. I'm running an ubuntu based system and am
using the oibaf ppa. The current version is 20.0.

On Wed, Dec 4, 2019 at 1:54 AM <email address hidden> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=201957
>
> Pierre-Eric Pelloux-Prayer (<email address hidden>) changed:
>
> What |Removed |Added
>
> ----------------------------------------------------------------------------
> CC|
> |pierre-eric.pelloux-prayer@
> | |amd.com
>
> --- Comment #22 from Pierre-Eric Pelloux-Prayer (
> <email address hidden>) ---
> (In reply to shallowaloe from comment #16)
> > Created attachment 285665 [details]
> > 5 second video clip that triggers a crash
> >
> > Hi,
> >
> > I think I'm having the same problem as you guys. I run a mythbackend
> where
> > I record cable television and those recordings often crash my system when
> > hardware decoding is enabled. Usually it's just the screen that freezes
> and
> > I can still ssh to it.
> >
> > Kernel 5.1.6 was an exception for me too, with that kernel I'm able to
> > restart the display manager and recover without having to reboot.
> >
> > Attached is a short video that crashes my system. I can trigger the
> alert
> > by running:
> >
> > mpv --vo=vaapi out.ts
> >
> > I'm wondering if it crashes your systems too and if it's related.
>
>
> This one is probably a Mesa issue, see
> https://gitlab.freedesktop.org/mesa/mesa/issues/2177
>
> What Mesa version are you using?
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Revision history for this message
In , janpieter.sollie (janpieter.sollie-linux-kernel-bugs) wrote :

Hi everyone,

I have the same issue with a Fiji Nano GPU: UVD6 and VCE3 timeout in ring buffer test @ boot with the AMDGPU driver. Other rings seem to work correctly.
To make sure the hardware functions like it should, and it's not a HW error, where (in the amdgpu driver) can I increase the timeout value?

Revision history for this message
In , janpieter.sollie (janpieter.sollie-linux-kernel-bugs) wrote :

Created attachment 286575
kernel config 5.4.7 Fiji

Some additional info for my case:
- Running kernel 5.4.7 (vanilla), firmware 20191108 on gentoo
- Dmesg | grep -E "(drm)|(amdgpu)":
[ 3.930023] [drm] amdgpu kernel modesetting enabled.
[ 3.930217] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
[ 3.930219] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
[ 3.930221] amdgpu 0000:0a:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfce00000 -> 0xfce3ffff
[ 3.930224] fb0: switching to amdgpudrmfb from EFI VGA
[ 3.930475] [drm] initializing kernel modesetting (FIJI 0x1002:0x7300 0x1002:0x0B36 0xCA).
[ 3.930486] [drm] register mmio base: 0xFCE00000
[ 3.930486] [drm] register mmio size: 262144
[ 3.930495] [drm] add ip block number 0 <vi_common>
[ 3.930495] [drm] add ip block number 1 <gmc_v8_0>
[ 3.930496] [drm] add ip block number 2 <tonga_ih>
[ 3.930497] [drm] add ip block number 3 <gfx_v8_0>
[ 3.930498] [drm] add ip block number 4 <sdma_v3_0>
[ 3.930498] [drm] add ip block number 5 <powerplay>
[ 3.930499] [drm] add ip block number 6 <dm>
[ 3.930500] [drm] add ip block number 7 <uvd_v6_0>
[ 3.930500] [drm] add ip block number 8 <vce_v3_0>
[ 3.930715] [drm] UVD is enabled in physical mode
[ 3.930715] [drm] VCE enabled in physical mode
[ 3.930743] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
[ 3.930751] amdgpu 0000:0a:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
[ 3.930753] amdgpu 0000:0a:00.0: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
[ 3.930758] [drm] Detected VRAM RAM=4096M, BAR=256M
[ 3.930759] [drm] RAM width 512bits HBM
[ 3.930838] [drm] amdgpu: 4096M of VRAM memory ready
[ 3.930841] [drm] amdgpu: 4096M of GTT memory ready.
[ 3.930860] [drm] GART: num cpu pages 262144, num gpu pages 262144
[ 3.930928] [drm] PCIE GART of 1024M enabled (table at 0x000000F4001D5000).
[ 3.934174] [drm] Chained IB support enabled!
[ 3.940198] amdgpu: [powerplay] hwmgr_sw_init smu backed is fiji_smu
[ 3.941748] [drm] Found UVD firmware Version: 1.91 Family ID: 12
[ 3.941752] [drm] UVD ENC is disabled
[ 3.943542] [drm] Found VCE firmware Version: 55.2 Binary ID: 3
[ 4.009146] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
[ 4.040084] [drm] Display Core initialized with v3.2.48!
[ 4.040542] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[ 4.040543] [drm] Driver supports precise vblank timestamp query.
[ 4.067774] [drm] UVD initialized successfully.
[ 4.168780] [drm] VCE initialized successfully.
[ 4.170163] [drm] Cannot find any crtc or sizes
[ 4.171948] [drm] Initialized amdgpu 3.35.0 20150101 for 0000:0a:00.0 on minor 0
[ 7.280062] amdgpu 0000:0a:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on uvd (-110).
[ 8.400365] amdgpu 0000:0a:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vce0 (-110).
[ 8.400370] [drm:process_one_work] *ERROR* ib ring test failed (-110).

Revision history for this message
In , delentef (delentef-linux-kernel-bugs) wrote :

Hello, I have the same problem on a Huawei Matebook D lapop, processor is an AMD Ryzen 5 with an integrated Radeon Vega Mobile GPU.

I use Fedora 31. The problem appeared when upgrading from then 5.3.16 kernel to the 5.4.6 kernel. Reverting to 5.3.16 solved the issue.

At some moments the UI (XFCE) freezes for about 5 seconds; I can move the mouse cursor but I can't get any keyboard input (not in X, not by switching console). Each time the freeze occurs dmesg shows the messages

[ 45.530374] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[ 50.139408] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

I include /proc/cpuinfo and lspci outputs.

Revision history for this message
In , delentef (delentef-linux-kernel-bugs) wrote :

Created attachment 286899
/proc/cpuinfo

Revision history for this message
In , delentef (delentef-linux-kernel-bugs) wrote :

Created attachment 286901
lspci output

Revision history for this message
In , mh (mh-linux-kernel-bugs) wrote :

Hi. This bug is already reported here by me https://gitlab.freedesktop.org/drm/amd/issues/953

If possible try a 5.5-rc kernel and see if it's fixed there. It's fixed - at least for me - in the drm-tree.

Best regards
Matthias

Revision history for this message
In , sellis (sellis-linux-kernel-bugs) wrote :

I"m seeing the same issue on Ubuntu 18.04 with

Upstream PPA "sudo add-apt-repository ppa:oibaf/graphics-drivers"

[ 321.412530] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[ 326.286306] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=4447, emitted seq=4449
[ 326.286395] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process mythfrontend.re pid 2410 thread mythfronte:cs0 pid 2880

AMDGPUPRO driver 19.50-967956

[20913.330563] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[20918.450513] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
[20923.570306] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[20928.690699] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

Revision history for this message
In , mh (mh-linux-kernel-bugs) wrote :

Hi,

for me this bug is fixed with a 5.5 kernel. And I'm wondering if this is fixed for all of you, too.

Best
Matthias

Revision history for this message
In , j.cordoba (j.cordoba-linux-kernel-bugs) wrote :

I agree. Fixed for me too

Revision history for this message
In , udovdh (udovdh-linux-kernel-bugs) wrote :
Download full text (7.4 KiB)

I still see them on 5.6.13:

[191571.372560] sd 11:0:0:0: [sde] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
[205796.424607] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=4518280, emitted seq=4518282
[205796.424637] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process mpv pid 488243 thread mpv:cs0 pid 488257
[205796.424640] amdgpu 0000:0a:00.0: GPU reset begin!
[205800.840504] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[205800.937565] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
[205800.938060] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
[205800.938849] [drm] PSP is resuming...
[205800.958729] [drm] reserve 0x400000 from 0xf47f800000 for PSP TMR
[205800.972414] [drm] psp command (0x5) failed and response status is (0xFFFF0007)
[205801.176411] amdgpu 0000:0a:00.0: RAS: ras ta ucode is not available
[205801.460775] [drm] kiq ring mec 2 pipe 1 q 0
[205801.460986] amdgpu 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0002 address=0x800002300 flags=0x0000]
[205801.516698] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[205801.516709] amdgpu 0000:0a:00.0: ring gfx uses VM inv eng 0 on hub 0
[205801.516713] amdgpu 0000:0a:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[205801.516717] amdgpu 0000:0a:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[205801.516720] amdgpu 0000:0a:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[205801.516724] amdgpu 0000:0a:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[205801.516727] amdgpu 0000:0a:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[205801.516730] amdgpu 0000:0a:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[205801.516733] amdgpu 0000:0a:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[205801.516736] amdgpu 0000:0a:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[205801.516740] amdgpu 0000:0a:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[205801.516743] amdgpu 0000:0a:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[205801.516746] amdgpu 0000:0a:00.0: ring vcn_dec uses VM inv eng 1 on hub 1
[205801.516749] amdgpu 0000:0a:00.0: ring vcn_enc0 uses VM inv eng 4 on hub 1
[205801.516752] amdgpu 0000:0a:00.0: ring vcn_enc1 uses VM inv eng 5 on hub 1
[205801.516755] amdgpu 0000:0a:00.0: ring jpeg_dec uses VM inv eng 6 on hub 1
[205801.525996] [drm] recover vram bo from shadow start
[205801.525998] [drm] recover vram bo from shadow done
[205801.526008] [drm] Skip scheduling IBs!
[205801.526051] amdgpu 0000:0a:00.0: GPU reset(1) succeeded!
[205802.536444] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=4518342, emitted seq=4518344
[205802.536523] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 3825 thread gnome-shel:cs0 pid 3834
[205802.536531] amdgpu 0000:0a:00.0: GPU reset begin!
[205806.728558] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[205806.821326] amdgpu 0000:0a:00.0: GPU reset succeeded, trying to resume
[205806.821578] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
[205806.821899] [drm] PSP is...

Read more...

Revision history for this message
In , panospolychronis (panospolychronis-linux-kernel-bugs) wrote :
Download full text (21.6 KiB)

The problem still exists with Linux Kernel 5.8-rc1 from git. (My graphics card is Radeon 5600XT)

[20581.087159] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2768656, emitted seq=2768658
[20581.087212] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process DOOMEternalx64v pid 8875 thread DOOMEternalx64v pid 8875
[20581.087217] amdgpu 0000:29:00.0: amdgpu: GPU reset begin!
[20583.381257] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[20585.087232] amdgpu 0000:29:00.0: amdgpu: failed to suspend display audio
[20585.156036] snd_hda_codec_hdmi hdaudioC0D0: HDMI: ELD buf size is 0, force 128
[20585.156052] snd_hda_codec_hdmi hdaudioC0D0: HDMI: invalid ELD data byte 0
[20585.463157] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[20585.463205] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[20585.694999] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[20585.695047] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[20585.926951] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[20588.045497] amdgpu 0000:29:00.0: amdgpu: GPU reset succeeded, trying to resume
[20588.045605] [drm] PCIE GART of 512M enabled (table at 0x0000008000E10000).
[20588.045682] [drm] VRAM is lost due to GPU reset!
[20588.048023] [drm] PSP is resuming...
[20588.218089] [drm] reserve 0x900000 from 0x817e400000 for PSP TMR
[20588.287093] amdgpu 0000:29:00.0: amdgpu: RAS: optional ras ta ucode is not available
[20588.293101] amdgpu: SMU is resuming...
[20588.295088] amdgpu: SMU is resumed successfully!
[20588.413155] [drm] kiq ring mec 2 pipe 1 q 0
[20588.417493] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[20588.417632] [drm] JPEG decode initialized successfully.
[20588.417690] amdgpu 0000:29:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[20588.417693] amdgpu 0000:29:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[20588.417697] amdgpu 0000:29:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[20588.417700] amdgpu 0000:29:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[20588.417703] amdgpu 0000:29:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[20588.417707] amdgpu 0000:29:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[20588.417709] amdgpu 0000:29:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[20588.417713] amdgpu 0000:29:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[20588.417716] amdgpu 0000:29:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[20588.417719] amdgpu 0000:29:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[20588.417721] amdgpu 0000:29:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[20588.417724] amdgpu 0000:29:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[20588.417726] amdgpu 0000:29:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
[20588.417728] amdgpu 0000:29:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
[20588.417730] amdgpu 0000:29:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on h...

Revision history for this message
In , randyk161 (randyk161-linux-kernel-bugs) wrote :
Download full text (30.3 KiB)

I've been getting "ring gfx timeouts" for some time, most of the time it's when the computer has not had any input for a while (while I'm away from it). When it freezes I can SSH into it but when I try to do a: "shutdown -h now" it boots me out of SSH as it should but the computer never seems to actually shutdown. The screen stays frozen with whatever was on the display when it froze. Any help would be greatly appreciated, here is my info:

Mobo: AsRock AB350 Pro4 UEFI: 5.80
Video card: Sapphire Nitro+ RX580 (8GB)
Distro: Manjaro
Kernel: 5.7.9-1-MANJARO

Aug 09 21:33:06.054857 kernel: pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
Aug 09 21:33:06.068305 kernel: pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
Aug 09 21:33:06.068636 kernel: pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00200000/00000000
Aug 09 21:33:06.068863 kernel: pcieport 0000:00:03.1: AER: [21] ACSViol (First)
Aug 09 21:33:06.069137 kernel: amdgpu 0000:0a:00.0: AER: can't recover (no error_detected callback)
Aug 09 21:33:06.069421 kernel: snd_hda_intel 0000:0a:00.1: AER: can't recover (no error_detected callback)
Aug 09 21:33:06.069633 kernel: pcieport 0000:00:03.1: AER: device recovery failed
Aug 09 21:33:16.258283 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=9087, emitted seq=9089
Aug 09 21:33:16.258412 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Aug 09 21:33:16.258446 kernel: amdgpu 0000:0a:00.0: GPU reset begin!
Aug 09 21:33:16.258741 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Aug 09 21:33:16.258773 kernel: amdgpu: [powerplay]
                                last message was failed ret is 65535
Aug 09 21:33:16.258803 kernel: amdgpu: [powerplay]
                                failed to send message 261 ret is 65535
Aug 09 21:33:16.258835 kernel: amdgpu: [powerplay]
                                last message was failed ret is 65535
Aug 09 21:33:16.258869 kernel: amdgpu: [powerplay]
                                failed to send message 261 ret is 65535
Aug 09 21:33:16.258896 kernel: amdgpu: [powerplay]
                                last message was failed ret is 65535
Aug 09 21:33:16.258925 kernel: amdgpu: [powerplay]
                                failed to send message 261 ret is 65535
Aug 09 21:33:16.258951 kernel: amdgpu: [powerplay]
                                last message was failed ret is 65535
Aug 09 21:33:16.258977 kernel: amdgpu: [powerplay]
                                failed to send message 261 ret is 65535
Aug 09 21:33:16.259009 kernel: amdgpu: [powerplay]
                                last message was failed ret is 65535
Aug 09 21:33:16.259035 kernel: amdgpu: [powerplay]
                                failed to send message 261 ret is 65535
Aug 09 21:33:16.259060 kernel: amdgpu: [powerplay]
                                last message was failed ret is 65535
Aug 09 21:33:16.259084 kernel: amdgpu: [powerplay]
                            ...

29 comments hidden view all 109 comments
Revision history for this message
In , inferrna (inferrna-linux-kernel-bugs) wrote :
Download full text (7.8 KiB)

I have same bug with firefox (happened once a day, starting about a week ago)

[ 4409.071226] BUG: unable to handle page fault for address: fffffffffffffff8
[ 4409.071234] #PF: supervisor read access in kernel mode
[ 4409.071235] #PF: error_code(0x0000) - not-present page
[ 4409.071237] PGD 427e12067 P4D 427e12067 PUD 427e14067 PMD 0
[ 4409.071240] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 4409.071242] CPU: 18 PID: 191 Comm: uvd Tainted: G OE 5.16.8uksm #1
[ 4409.071245] Hardware name: Hewlett-Packard HP Z420 Workstation/1589, BIOS J61 v03.96 10/29/2019
[ 4409.071246] RIP: 0010:swake_up_locked+0x17/0x40
[ 4409.071251] Code: ff ff ff eb ad 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 0f 1f 44 00 00 48 8b 57 08 48 8d 47 08 48 39 c2 74 25 53 48 8b 5f 08 <48> 8b 7b f8 e8 80 7f fe ff 48 8b 13 48 8b 43 08 48 89 42 08 48 89
[ 4409.071253] RSP: 0018:ffffbbdf012b7e70 EFLAGS: 00010007
[ 4409.071254] RAX: ffff9719549270b0 RBX: 0000000000000000 RCX: 0000000000000000
[ 4409.071256] RDX: 0000000000000000 RSI: ffff97185d547250 RDI: ffff9719549270a8
[ 4409.071257] RBP: ffff9719549270a8 R08: ffff9716473efec0 R09: ffff9716473efed8
[ 4409.071258] R10: ffff971646cc3000 R11: ffff971646cc3000 R12: 0000000000000286
[ 4409.071259] R13: ffff9716473eebe0 R14: ffff9716ee901bc0 R15: ffff9719549270a0
[ 4409.071260] FS: 0000000000000000(0000) GS:ffff97213fc80000(0000) knlGS:0000000000000000
[ 4409.071262] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4409.071263] CR2: fffffffffffffff8 CR3: 0000000427e10006 CR4: 00000000001706e0
[ 4409.071264] Call Trace:
[ 4409.071267] <TASK>
[ 4409.071269] complete+0x2f/0x40
[ 4409.071271] drm_sched_main+0x24b/0x450
[ 4409.071274] ? wait_woken+0x70/0x70
[ 4409.071289] ? drm_sched_job_done.isra.0+0x130/0x130
[ 4409.071290] kthread+0x169/0x190
[ 4409.071294] ? set_kthread_struct+0x40/0x40
[ 4409.071297] ret_from_fork+0x1f/0x30
[ 4409.071301] </TASK>
[ 4409.071302] Modules linked in: xt_conntrack nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter cmac rfcomm vboxnetadp(OE) vboxnetflt(OE) iptable_mangle xt_CHECKSUM xt_tcpudp iptable_nat xt_comment xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc overlay iptable_filter vboxdrv(OE) bnep cpufreq_powersave zram binfmt_misc squashfs snd_emu10k1_synth snd_hda_codec_realtek snd_emux_synth snd_seq_midi_emul snd_seq_virmidi snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel intel_rapl_msr snd_intel_dspcfg intel_rapl_common snd_emu10k1 snd_hda_codec snd_util_mem snd_ac97_codec snd_hda_core nls_iso8859_1 hp_wmi nls_cp866 ac97_bus platform_profile sparse_keymap snd_hwdep wmi_bmof btusb snd_pcm sb_edac btrtl x86_pkg_temp_thermal intel_powerclamp snd_seq_midi btbcm snd_seq_midi_event btintel snd_rawmidi kvm_intel bluetooth input_leds snd_seq kvm ecdh_generic snd_seq_device snd_timer irqbypass emu10k1_gp serio_raw snd gameport ioatdma soundcore dca
[ 4409.071342] wmi mac_hid xpad ff_memless coretemp mei_me mei hwmon_vid i5500_temp msr ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid crc32_pclmul ghash_clmulni_intel aesni_intel e1000e psmou...

Read more...

Revision history for this message
In , randyk161 (randyk161-linux-kernel-bugs) wrote :

So I've been running for about 2.5 weeks now using the amdgpu.runpm=0 kernel parameter and I've had no crashes or freezes so far. I'm cautiously optimistic that for me at least this may have solved the problem. So far I haven't noticed any side effects (performance degradation etc.).

I understand that amdgpu.runpm=0 is related to power management but I don't know the specifics. Possibly Alex Deucher can chime in and specify exactly what this parameter does?

See my previous comments for some context:
comment 35
comment 62
comment 63

Revision history for this message
In , alexdeucher (alexdeucher-linux-kernel-bugs) wrote :

(In reply to Randune from comment #66)
>
> I understand that amdgpu.runpm=0 is related to power management but I don't
> know the specifics. Possibly Alex Deucher can chime in and specify exactly
> what this parameter does?

The runpm parameter allows you to disable runtime power management which powers down dGPUs at runtime if they are not being used (e.g., hybrid graphics laptops or desktop systems with multiple GPUs) to save power. It does not affect dynamic power management while the chip is powered up. Disabling it will increase idle power usage.

Revision history for this message
In , s48gs.w (s48gs.w-linux-kernel-bugs) wrote :

Had this problem with Ryzen3 3200 CPU (Vega8 integrated) on A320M-DVS R4.0 motherboard.
microcode: CPU: patch_level=0x08108109
microcode: Microcode Update Driver: v2.2.

I had 100% scenario to trigger freeze:
1. play video (in webbrowser or video player, should stay visible(dont hide tab or minimize window))
2. open shadertoy website (any shader, keep it rendering also keep window visible)
3. open any OpenGL or Vulkan application (that use integrated GPU)
4. start pressing fullscreen/un-fullscreen button on shadertoy shader (~5 times is enough to trigger bug, system will slowdown slowly in next 10-20 mins till freeze, just wait(visible on shadertoy FPS counter))
... and freeze

I use this PC for 2 years, every Linux kernel had this "freeze" when used integrated GPU. Current kernel OpenSuse 5.17.4-1-default.
(my solution for all this time was obvious - disable integrated GPU in BIOS and use discrete only, and everything works)

Today I checked motherboard website - https://asrock.com/MB/AMD/A320M-DVS%20R4.0/index.asp#BIOS they have 7.00 and 7.10 BIOS, I was on 4.00 BIOS
So I updated BIOS to 7.00 and 7.10 (now)... and everything works - no freezes anymore.
So it was firmware problem (atleast for me) that fixed by BIOS update.

Revision history for this message
In , s48gs.w (s48gs.w-linux-kernel-bugs) wrote :

Edit - got freeze after using PC for 4 hours, before it was 20 min longest time I could use integrated GPU, so it not fixed completely look like, just some improvement(or I just got lucky)... im back to use Discrete GPU.

Revision history for this message
In , martin.von.wittich (martin.von.wittich-linux-kernel-bugs) wrote :
Download full text (4.6 KiB)

My Ubuntu 20.04 desktop is crashing several times per day due to this bug since I've upgraded my computer from an old Intel Xeon to an AMD Ryzen 9 5900X on a B550 mainboard. I've had the same AMD RX Vega 56 graphics card in both computers, so I assume this is probably more related to the mainboard/CPU than to the graphics card.

The crashes from today:

```
martin@martin ~ % grep amdgpu /var/log/syslog | grep ERROR | grep -v 'Failed to initialize parser'
Jun 11 03:15:33 martin kernel: [21494.642889] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750601, emitted seq=1750603
Jun 11 03:15:33 martin kernel: [21494.643055] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread firefox:cs0 pid 5123
Jun 11 03:15:50 martin kernel: [21511.795007] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750605, emitted seq=1750608
Jun 11 03:15:50 martin kernel: [21511.795174] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread firefox:cs0 pid 5123
Jun 11 15:56:07 martin kernel: [ 1477.069969] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216293, emitted seq=216295
Jun 11 15:56:07 martin kernel: [ 1477.070140] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 5237 thread firefox:cs0 pid 5302
Jun 11 15:56:22 martin kernel: [ 1492.174077] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216297, emitted seq=216300
Jun 11 15:56:22 martin kernel: [ 1492.174248] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Jun 11 16:03:28 martin kernel: [ 1918.161101] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264406, emitted seq=264408
Jun 11 16:03:28 martin kernel: [ 1918.161271] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread firefox:cs0 pid 10633
Jun 11 16:03:49 martin kernel: [ 1938.385307] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264410, emitted seq=264413
Jun 11 16:03:49 martin kernel: [ 1938.385479] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread firefox:cs0 pid 10633
Jun 11 23:28:12 martin kernel: [25491.854294] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390985, emitted seq=2390987
Jun 11 23:28:12 martin kernel: [25491.854460] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 4922 thread firefox:cs0 pid 4989
Jun 11 23:28:28 martin kernel: [25507.982446] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390989, emitted seq=2390992
Jun 11 23:28:28 martin kernel: [25507.982613] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
Jun 11 23:29:51 martin kernel: [25591.333483] amdgpu 0000:2d:00.0: amdgpu: WALKER_ERROR: 0x0
Jun 11 23:29:51 martin kernel: [25591.333485] amdgpu 0000:2d:00.0: amdgpu: MAPPING_ERROR: 0x0
Jun 11 23:30:01 martin kernel: [25601.412838] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring uvd_...

Read more...

Revision history for this message
In , panospolychronis (panospolychronis-linux-kernel-bugs) wrote :
Download full text (5.1 KiB)

(In reply to Martin von Wittich from comment #70)
> My Ubuntu 20.04 desktop is crashing several times per day due to this bug
> since I've upgraded my computer from an old Intel Xeon to an AMD Ryzen 9
> 5900X on a B550 mainboard. I've had the same AMD RX Vega 56 graphics card in
> both computers, so I assume this is probably more related to the
> mainboard/CPU than to the graphics card.
>
> The crashes from today:
>
> ```
> martin@martin ~ % grep amdgpu /var/log/syslog | grep ERROR | grep -v 'Failed
> to initialize parser'
> Jun 11 03:15:33 martin kernel: [21494.642889] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750601, emitted seq=1750603
> Jun 11 03:15:33 martin kernel: [21494.643055] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread
> firefox:cs0 pid 5123
> Jun 11 03:15:50 martin kernel: [21511.795007] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1750605, emitted seq=1750608
> Jun 11 03:15:50 martin kernel: [21511.795174] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 5037 thread
> firefox:cs0 pid 5123
> Jun 11 15:56:07 martin kernel: [ 1477.069969] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216293, emitted seq=216295
> Jun 11 15:56:07 martin kernel: [ 1477.070140] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 5237 thread
> firefox:cs0 pid 5302
> Jun 11 15:56:22 martin kernel: [ 1492.174077] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=216297, emitted seq=216300
> Jun 11 15:56:22 martin kernel: [ 1492.174248] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
> Jun 11 16:03:28 martin kernel: [ 1918.161101] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264406, emitted seq=264408
> Jun 11 16:03:28 martin kernel: [ 1918.161271] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread
> firefox:cs0 pid 10633
> Jun 11 16:03:49 martin kernel: [ 1938.385307] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=264410, emitted seq=264413
> Jun 11 16:03:49 martin kernel: [ 1938.385479] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 10569 thread
> firefox:cs0 pid 10633
> Jun 11 23:28:12 martin kernel: [25491.854294] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390985, emitted seq=2390987
> Jun 11 23:28:12 martin kernel: [25491.854460] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process firefox pid 4922 thread
> firefox:cs0 pid 4989
> Jun 11 23:28:28 martin kernel: [25507.982446] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2390989, emitted seq=2390992
> Jun 11 23:28:28 martin kernel: [25507.982613] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
> Jun 11 23:29:51 martin kernel: [25591.333483] amdgpu 0000:2d:00.0: amdgpu:
> WALKER_ERROR: 0x0
> Jun 11 23:29:51 martin kernel: [25591.333485] am...

Read more...

Revision history for this message
In , martin.von.wittich (martin.von.wittich-linux-kernel-bugs) wrote :

I can confirm that adding "amdgpu.dpm=0" to the kernel command line seems to resolve this issue - I enabled that option on 2022-06-12 13:24, and my system didn't crash at all on 2022-06-12 - 2022-06-14 (I was on vacation from 2022-06-15 on and didn't use my computer from then on).

I don't use Linux for gaming and therefore can't comment how badly this affects gaming performance, but I did notice mpv could no longer play 1080p x264 video without stuttering when it defaults to --vo=gpu. Using another --vo like sdl seems to be a viable workaround.

> Did you try with the latest Linux Kernel? I had a lot of gpu lockups like this. Also try these kernel parameters : "amdgpu.ppfeaturemask=0xffffbffb amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt"" ( you might also try with amdgpu.ppfeaturemask=0xfffd7fff or amdgpu.ppfeaturemask=0xffffffff )

I'll try these next.

Revision history for this message
In , martin.von.wittich (martin.von.wittich-linux-kernel-bugs) wrote :

Sorry, forgot to mention in my last post and now can't edit: interestingly enough, the attached video "5 second video clip that triggers a crash" still successfully triggers the crash.

Seems to me like the root issue isn't actually in the dynamic power management code, but somewhere else, and the DPM is just one of several things that can trigger it?

Revision history for this message
In , martin.von.wittich (martin.von.wittich-linux-kernel-bugs) wrote :

> Did you try with the latest Linux Kernel? I had a lot of gpu lockups like this. Also try these kernel parameters : "amdgpu.ppfeaturemask=0xffffbffb amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt"" ( you might also try with amdgpu.ppfeaturemask=0xfffd7fff or amdgpu.ppfeaturemask=0xffffffff )

I can confirm that at least on the current Ubuntu linux-image-oem-20.04d kernel, these options do not resolve the issue:

```
martin@martin ~ % uname -a
Linux martin 5.14.0-1042-oem #47-Ubuntu SMP Fri Jun 3 18:17:11 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
martin@martin ~ % cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-5.14.0-1042-oem root=UUID=1bd000ac-1487-4457-be1a-5ea901ded9e9 ro amdgpu.ppfeaturemask=0xffffbffb amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt quiet
martin@martin ~ % dmesg -T | grep 'ring gfx timeout'
[Mi Jun 22 14:48:07 2022] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1820983, emitted seq=1820985
[Mi Jun 22 14:48:18 2022] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1820987, emitted seq=1820990
```

I had enabled these options on 2022-06-20 14:14 UTC+2, this is the first crash I've encountered since then.

I have no idea how to build the latest kernel and therefore haven't tested that yet.

I'll now revert back to amdgpu.dpm=0.

Revision history for this message
In , s48gs.w (s48gs.w-linux-kernel-bugs) wrote :

> Did you try with the latest Linux Kernel? I had a lot of gpu lockups like
> this. Also try these kernel parameters : "amdgpu.ppfeaturemask=0xffffbffb
> amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1
> amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt"" ( you might also
> try with amdgpu.ppfeaturemask=0xfffd7fff or amdgpu.ppfeaturemask=0xffffffff )

I tried.

my kernel:
"Linux 5.17.4-1-default #1 SMP PREEMPT Wed Apr 20 07:43:03 UTC 2022 (75e9961) x86_64 x86_64 x86_64 GNU/Linux"

(this video linked above - were not able to freeze integrated AMD GPU for me, I mean before I tested with no kernel parameters)

Result is surprising - no crash/freeze for 4+ hours already, I did launch lots of apps that were reason of freeze for me before.

As I described above - https://bugzilla.kernel.org/show_bug.cgi?id=201957#c68 for me this freeze happening only when I used OpenGL/Vulkan and video on background(everything on integrated GPU), and how it was looking from user experience - when bug triggered(randomly) everything just slowly become lower and lower FPS, apps that was working on 60fps on fullscreen drop to 5 FPS, and video also drop to 5-10fps (UI still was responsible)... and freeze in next few mins/seconds.

Full kernel boot option now: "splash=silent quiet amdgpu.ppfeaturemask=0xffffbffb amdgpu.noretry=0 amdgpu.lockup_timeout=0 amdgpu.gpu_recovery=1 amdgpu.audio=0 amdgpu.deep_color=1 amd_iommu=on iommu=pt "

Now, after boot with these options, I see:

Just after boot everything working (OpenGL/Vulkan acceleration by integrated GPU) with expected performance.

After trying to "trigger bug" (opening multiple OpenGL apps with Vulkan and WebGL and playing many videos) - OpenGL and Vulkan drops FPS to 20(constant for single triangle in fullscreen), WebGL2 does not work anymore in webbrowser(even after browser restart), but Video - still playing with 60 fps with no lag, and system UI also does not lag.

So GPU graphics acceleration just drop to very low performance mode look like, but everything else works fine. (also launching graphic apps(native only) using Nvidia GPU works with 60fps as expected).

Interesting - since FPS droped 20 I can no longer launch "anything" in Wine (any version include Proton) (after boot it was working), I launched few apps after boot and check them when GPU FPS drops wine always crash with:
"wine: Unhandled page fault on execute access to 00007F894E200460 at address 00007F894E200460 (thread 0070), starting debugger..."
(not being able to use Wine is a big disadvantage)

Revision history for this message
In , s48gs.w (s48gs.w-linux-kernel-bugs) wrote :

Wine problem - this happened because (how/why/when) '/usr/share/vulkan/icd.d/nvidia_icd.json' file was deleted... no idea how and why this happened when AMD GPU drops its FPS(obviously this file exists when I use just Nvidia GPU with integrated AMD disabled)

so fix for wine gonna be - "VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.x86_64.json winecfg"

super weird, so wine problem fixed I think

Revision history for this message
In , s48gs.w (s48gs.w-linux-kernel-bugs) wrote :

but even creating nvidia_icd.json
{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "/usr/lib64/libGLX_nvidia.so.0",
        "api_version" : "1.3.211"
    }
}

does not help wine, Wine still crashing with same error on trying use/initialize Nvidia
but I can use Nvidia outside of Wine from native apps (and Vulkan works), so it must be related to AMD gpu driver somehow (before it was not happening, I first time seeing wine crashing this way(in previous times when I tested AMD GPU integrated))

P.S. I have second PC with same AMD Vega 8 integrated GPU, and there it works fine(never crashed/freeze even once), other PC has other motherboard, this why I originally think it problem with motherboard, but current "boot option" help to make integrated GPU stable on this PC.

Revision history for this message
In , s48gs.w (s48gs.w-linux-kernel-bugs) wrote :

(I did small mistake in my file organizing, creating nvidia_icd.json with listed above content is enough to fix Wine for me, everything works now)

Revision history for this message
In , s48gs.w (s48gs.w-linux-kernel-bugs) wrote :

Updated to kernel 5.18.4-1-default #1 SMP PREEMPT_DYNAMIC Wed Jun 15 06:00:33 UTC 2022 (ed6345d) x86_64 x86_64 x86_64 GNU/Linux (OpenSuSe latest for now)

Seems my integrated AMD GPU freeze completely fixed even without using previous boot option (in 5.17 it was freezing without boot option), also integrated GPU does not go to "low performance mode forever"(like it was with boot option before) it continues working for hours on max performance(I mean it works without slowdown like before)

... but now Nvidia GPU does not work anymore from AMD (when integrated is main GPU), Nvidia 515.48.07 driver(latest now), in X11 and Wayland, Nvidia driver correctly installed and device visible (nvidia-smi works and vulkaninfo --summary list Nvidia GPU correctly), on creating Vulkan surface on Nvidia device application always crash (any application)... (just tested - disabling AMD integrated and boot using Nvidia - everything works there, Vulkan etc)

So fixing integrated AMD GPU result in Nvidia does not work anymore... okey (im back to use discrete Nvidia only again)

Revision history for this message
In , jrch2k10 (jrch2k10-linux-kernel-bugs) wrote :
Download full text (17.7 KiB)

same issue here with (also LTS kernel as well)

Linux archlinux 5.18.7-262-tkg-pds #1 TKG SMP PREEMPT_DYNAMIC Mon, 27 Jun 2022 15:50:06 +0000 x86_64 GNU/Linux

[11090.086287] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.086296] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.086302] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.195133] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.195139] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.195143] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.195150] [drm] Cannot get clockgating state when UVD is powergated.
[11090.195152] [drm] Cannot get clockgating state when VCE is powergated.
[11090.695288] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11090.699331] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11091.194893] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11091.194898] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11091.194901] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11091.194908] [drm] Cannot get clockgating state when UVD is powergated.
[11091.194909] [drm] Cannot get clockgating state when VCE is powergated.
[11091.695473] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11092.194965] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11092.194969] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11092.194973] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11092.194979] [drm] Cannot get clockgating state when UVD is powergated.
[11092.194980] [drm] Cannot get clockgating state when VCE is powergated.
[11092.695749] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11093.195046] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11093.195050] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11093.195053] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11093.195060] [drm] Cannot get clockgating state when UVD is powergated.
[11093.195061] [drm] Cannot get clockgating state when VCE is powergated.
[11093.695004] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11094.195065] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11094.195070] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11094.195074] amdgpu 0000:02:00.0: amdgpu:
               last message was failed ret is 65535
[11094.195082] [drm] Cannot get clockgating state when UVD is powergated.
[11094.195083] [drm] Cannot get clockgating state when VCE is powergated.
[11094.695286] amdgpu 0000:02:00.0: amdgpu:
               last mess...

Revision history for this message
In , s48gs.w (s48gs.w-linux-kernel-bugs) wrote :

Nvidia released 515.57 drivers that fix "Nvidia being broken when used as second GPU in Linux", my bug above.
Nvidia GPU works again when AMD GPU main.

Revision history for this message
In , s48gs.w (s48gs.w-linux-kernel-bugs) wrote :

Afteer using this PC for few days with AMD Vega 8 (integrated) as main GPU I see no freezes at all. (before in 2021 it was freeze every 10-20 mins so I had to use Nvidia as main GPU)
(works with and without listed above kernel boot option)

I use OpenSuse kernel 5.18.4-1-default (not going to update for some time, because it works)

Maybe it just fixed for "my motherboard+CPU combination", my hardware:
Ryzen3 3200 CPU (Vega8 integrated) on A320M-DVS R4.0 motherboard.
microcode: CPU: patch_level=0x08108109
microcode: Microcode Update Driver: v2.2.

Wayland and x11 works, with Nvidia as second GPU.
Wayland slowdown(to like 1-2FPS whole UI performance) once after few hours of using, but it fixed just by switching to system-terminal(ctrl+alt+f1) and back, nothing crash video apps and graphic keep working.

integrated GPU performance still goes down(in few hours, randomly in 2-6 hours of PC use) and never go back, but its fine(since I have Nvidia second GPU for complex graphic), Vega 8 performance go down only in "complex shaders" FPS drop from 60 fullscreen(1080p) to 10-20 on complex raymarching shaders, but for system UI (Wayland/x11 Gnome 42) this is not noticeable, and video play on 60fps as expected. (Sleep mode also works, not every time(because Nvidia) but most of the time, same as when used Nvidia as main GPU)

Revision history for this message
In , s48gs.w (s48gs.w-linux-kernel-bugs) wrote :
Download full text (7.5 KiB)

Log from what I described above - "fixed just by switching to system-terminal(ctrl+alt+f1)", nothing crash even GPU apps keep working, just huge mouse+UI freeze and switching to F1 terminal and back fix it (Wayland).
Logs:

Jul 17 22:54:04 home-danil kernel: amdgpu 0000:07:00.0: amdgpu: Failed to send Message 7.
Jul 17 22:54:09 home-danil kernel: amdgpu 0000:07:00.0: amdgpu: Failed to send Message 7.
Jul 17 22:54:12 home-danil kernel: ------------[ cut here ]------------
Jul 17 22:54:12 home-danil kernel: WARNING: CPU: 1 PID: 1100 at drivers/gpu/drm/amd/amdgpu/../display/dc/clk_mgr/dcn10/rv1_clk_mgr_vbios_smu.c:120 rv1_vbios_smu_send_msg_with_param+0xa3/0xb0 [amdgpu]
Jul 17 22:54:12 home-danil kernel: Modules linked in: dm_crypt essiv authenc trusted asn1_encoder tee nvidia_uvm(POE) nvidia_modeset(POE) nvidia(POE) snd_seq_dummy snd_hrtimer snd_seq snd_seq_device af_packet nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set iscsi_ibft iscsi_boot_sysfs nfnetlink rfkill ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter qrtr vboxnetadp(O) vboxnetflt(O) vboxdrv(O) dmi_sysfs joydev intel_rapl_msr intel_rapl_common snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio edac_mce_amd snd_hda_intel snd_intel_dspcfg kvm_amd snd_intel_sdw_acpi snd_hda_codec r8169 pcspkr snd_hda_core kvm realtek snd_hwdep snd_pcm wmi_bmof mdio_devres snd_timer
Jul 17 22:54:12 home-danil kernel: libphy irqbypass snd soundcore efi_pstore i2c_piix4 gpio_amdpt gpio_generic acpi_cpufreq k10temp tiny_power_button nls_iso8859_1 squashfs nls_cp437 loop ext4 mbcache vfat jbd2 fat fuse configfs ip_tables x_tables hid_generic usbhid uas usb_storage amdgpu crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit drm_dp_helper drm_kms_helper aesni_intel crypto_simd syscopyarea sysfillrect sysimgblt fb_sys_fops cryptd drm cec xhci_pci xhci_pci_renesas sp5100_tco ccp rc_core xhci_hcd usbcore wmi video button btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efivarfs
Jul 17 22:54:12 home-danil kernel: CPU: 1 PID: 1100 Comm: systemd-logind Tainted: P OE 5.18.4-1-default #1 openSUSE Tumbleweed 59778fa2462c9ee971468464596d3fbe14e51d2e
Jul 17 22:54:12 home-danil kernel: Hardware name: To Be Filled By O.E.M. A320M-DVS R4.0/A320M-DVS R4.0, BIOS P7.10 12/23/2021
Jul 17 22:54:12 home-danil kernel: RIP: 0010:rv1_vbios_smu_send_msg_with_param+0xa3/0xb0 [amdgpu]
Jul 17 22:54:12 home-danil kernel: Code: 62 01 00 e8 8f 4e f5 ff 85 c0 74 d8 83 f8 01 75 19 48 8b 7d 00 5b be 93 62 01 00 48 c7 c2 00 99 cd c0 5d 41 5c e9 6d 4e f5 ff <0f> 0b eb e3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 81 c6 e7 03
Jul 17 22:54:12 home-danil kernel: RSP: 0018:ffff9f0a00b1f580 EFLAGS: 00010246
Jul 17 22:54:12 home-danil kernel: RAX: 00007570227d95d8 RBX: 00000000000000...

Read more...

Revision history for this message
In , 291765088 (291765088-linux-kernel-bugs) wrote :

amd driver problem,u can connect me ,i'll give u the final solution,email <email address hidden> ,maybe in China will get more efficent communication

Revision history for this message
In , hcarter1112 (hcarter1112-linux-kernel-bugs) wrote :
Download full text (13.5 KiB)

[67760.805903] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=19820784, emitted seq=19820786
[67760.806285] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process valheim.x86_64 pid 464107 thread valheim.x8:cs0 pid 464109
[67760.806667] amdgpu 0000:0d:00.0: amdgpu: GPU reset begin!
[67761.257012] amdgpu 0000:0d:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[67761.257232] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[67761.307862] [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:80:crtc-1] hw_done or flip_done timed out
[67761.516374] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* failed to halt cp gfx
[67761.542980] [drm] free PSP TMR buffer
[67761.587266] amdgpu 0000:0d:00.0: amdgpu: MODE1 reset
[67761.587269] amdgpu 0000:0d:00.0: amdgpu: GPU mode1 reset
[67761.587329] amdgpu 0000:0d:00.0: amdgpu: GPU smu mode1 reset
[67762.091974] amdgpu 0000:0d:00.0: amdgpu: GPU reset succeeded, trying to resume
[67762.092156] [drm] PCIE GART of 512M enabled (table at 0x0000008000300000).
[67762.092219] [drm] VRAM is lost due to GPU reset!
[67762.092220] [drm] PSP is resuming...
[67762.168492] [drm] reserve 0xa00000 from 0x8001000000 for PSP TMR
[67762.269801] amdgpu 0000:0d:00.0: amdgpu: RAS: optional ras ta ucode is not available
[67762.283510] amdgpu 0000:0d:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[67762.283513] amdgpu 0000:0d:00.0: amdgpu: SMU is resuming...
[67762.283516] amdgpu 0000:0d:00.0: amdgpu: smu driver if version = 0x0000000e, smu fw if version = 0x00000012, smu fw program = 0, version = 0x00413900 (65.57.0)
[67762.283519] amdgpu 0000:0d:00.0: amdgpu: SMU driver if version not matched
[67762.283549] amdgpu 0000:0d:00.0: amdgpu: use vbios provided pptable
[67762.343739] amdgpu 0000:0d:00.0: amdgpu: SMU is resumed successfully!
[67762.345104] [drm] DMUB hardware initialized: version=0x02020017
[67762.615558] [drm] kiq ring mec 2 pipe 1 q 0
[67762.618728] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[67762.618910] [drm] JPEG decode initialized successfully.
[67762.618918] amdgpu 0000:0d:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[67762.618921] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[67762.618922] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[67762.618924] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[67762.618925] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[67762.618926] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[67762.618927] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[67762.618929] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[67762.618930] amdgpu 0000:0d:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[67762.618931] amdgpu 0000:0d:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[67762.618933] amdgpu 0000:0d:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[67762.618934] amdgpu 0000:0d:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[67762.618936] amd...

Revision history for this message
In , smf-linux (smf-linux-linux-kernel-bugs) wrote :

Created attachment 304307
Started testing kernel 6.4-rc3 got the same problem

Revision history for this message
In , smf-linux (smf-linux-linux-kernel-bugs) wrote :

Is it worth the effort of bisecting this as it seems to be on a lot of kernel versions ?

thanks

Revision history for this message
In , kernel.org (kernel.org-linux-kernel-bugs) wrote :

Status = NEW after nearly 5 years?
I have the same problem

Aug 15 14:18:19 nb-tz kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=3442457, emitted seq=3442459
Aug 15 14:18:19 nb-tz kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2628 thread gnome-shel:cs0 pid 2679

Revision history for this message
In , priit (priit-linux-kernel-bugs) wrote :

AMD Vega 64 (vega10 chip)
kernel: 6.4.9

linux-firmware: 20230724

# graphical session died and had to log in again, computer didn't boot though...
aug 20 02:11:06 Zen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=368426139, emitted seq=368426141
aug 20 02:11:06 Zen kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 414636 thread firefox:cs0 pid 414712

linux-firmware: 20230810 (upgraded it... although there was no "vega10" changes inbetween)

# just freeze for like 30s and then it got unstuck again.
aug 23 23:09:24 Zen kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:60:crtc-0] hw_done or flip_done timed out
aug 23 23:09:34 Zen kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:63:crtc-1] hw_done or flip_done timed out
aug 23 23:09:44 Zen kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:66:crtc-2] hw_done or flip_done timed out

Revision history for this message
In , graham.oconnor (graham.oconnor-linux-kernel-bugs) wrote :

AMD Ryzen 3700U APU (Vega 10)

This issue has recently started happening, mostly when firing up games or graphically intensive tasks. One case of lockup during normal desktop use.

Worked fine on 6.4.X series (currently running on 6.4.12). However, all kernels in the 6.5 series cause the following:

[ 112.727138] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=9861, emitted seq=9863
[ 112.728214] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xwayland pid 919 thread Xwayland:cs0 pid 928
[ 112.729270] amdgpu 0000:04:00.0: amdgpu: GPU reset begin!
[ 112.885652] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
[ 112.885709] amdgpu 0000:04:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 112.886024] [drm] PCIE GART of 1024M enabled.
[ 112.886027] [drm] PTB located at 0x000000F400A00000
[ 112.886143] [drm] PSP is resuming...
[ 112.906168] [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
[ 112.985033] amdgpu 0000:04:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 112.992320] amdgpu 0000:04:00.0: amdgpu: RAP: optional rap ta ucode is not available
[ 113.733685] [drm] kiq ring mec 2 pipe 1 q 0
[ 113.998619] amdgpu 0000:04:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
[ 113.999249] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
[ 113.999957] amdgpu 0000:04:00.0: amdgpu: GPU reset(2) failed
[ 114.000006] amdgpu 0000:04:00.0: amdgpu: GPU reset end with ret = -110
[ 114.000010] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110

Revision history for this message
In , kcohar (kcohar-linux-kernel-bugs) wrote :

I can confirm this bug

Experiencing it on an AMD Ryzen 5 3500U (Vega 8), Fedora 39 beta, kernel 6.5.2.
Also on Arch (kernel 6.5.2).
No problems on Fedora 38 (kernel 6.2.x).

In my case it happens frequently with normal desktop use on Fedora and Arch.

Sep 23 03:39:34 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_low timeout, signaled seq=10067, emitted seq=10069
Sep 23 03:39:34 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process nautilus pid 5981 thread nautilus:cs0 pid 6173
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: MODE2 reset
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset succeeded, trying to resume
Sep 23 03:39:34 jackdaw kernel: [drm] PCIE GART of 1024M enabled.
Sep 23 03:39:34 jackdaw kernel: [drm] PTB located at 0x000000F400A00000
Sep 23 03:39:34 jackdaw kernel: [drm] PSP is resuming...
Sep 23 03:39:34 jackdaw kernel: [drm] reserve 0x400000 from 0xf47fc00000 for PSP TMR
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: RAS: optional ras ta ucode is not available
Sep 23 03:39:34 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: RAP: optional rap ta ucode is not available
Sep 23 03:39:34 jackdaw kernel: [drm] kiq ring mec 2 pipe 1 q 0
Sep 23 03:39:35 jackdaw kernel: amdgpu 0000:05:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110)
Sep 23 03:39:35 jackdaw kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
Sep 23 03:39:35 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset(2) failed
Sep 23 03:39:35 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset end with ret = -110
Sep 23 03:39:35 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110
Sep 23 03:39:35 jackdaw kernel: [drm] Skip scheduling IBs!
Sep 23 03:39:45 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_high timeout, signaled seq=9114, emitted seq=9116
Sep 23 03:39:45 jackdaw kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2206 thread gnome-shel:cs0 pid 2258
Sep 23 03:39:45 jackdaw kernel: amdgpu 0000:05:00.0: amdgpu: GPU reset begin!

Revision history for this message
In , aros (aros-linux-kernel-bugs) wrote :

AMDGPU development is on its own bug tracker:

https://gitlab.freedesktop.org/drm/amd/-/issues

If you're still affected, check for existing bug reports and if there are none, please repost over there.

Revision history for this message
In , aspicer (aspicer-linux-kernel-bugs) wrote :

I have also been having this issue. It started occurring recently (last 2-3 months). No other changes.

Mostly lockups while gaming (yuzu), one lockup because of chrome.

I was able to fix this issue by switching from HDMI to DP or DVI.

Revision history for this message
In , kcohar (kcohar-linux-kernel-bugs) wrote :

Created attachment 305165
attachment-27613-0.html

In my case the fix was adding amdgpu.mcbp=0 to the kernel parameters.

On Sat, Sep 30, 2023 at 8:57 PM <email address hidden> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=201957
>
> <email address hidden> changed:
>
> What |Removed |Added
>
> ----------------------------------------------------------------------------
> CC| |<email address hidden>
>
> --- Comment #93 from <email address hidden> ---
> I have also been having this issue. It started occurring recently (last 2-3
> months). No other changes.
>
> Mostly lockups while gaming (yuzu), one lockup because of chrome.
>
> I was able to fix this issue by switching from HDMI to DP or DVI.
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.

Revision history for this message
In , aspicer (aspicer-linux-kernel-bugs) wrote :

(In reply to KC from comment #94)

Did you have it set to 1 previously? If not, I'm not sure if that was the silver bullet, because it looks like it defaults to 0. https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html

mcbp (int)

It is used to enable mid command buffer preemption. (0 = disabled (default), 1 = enabled)

Revision history for this message
In , kcohar (kcohar-linux-kernel-bugs) wrote :

Created attachment 305166
attachment-16816-0.html

The default is now -1.
https://unix.stackexchange.com/questions/756281/kernel-6-5-2-seems-to-have-amdgpu-crash-on-no-retry-page-fault
https://www.kernel.org/doc/html/v6.5/gpu/amdgpu/module-parameters.html

I set it to zero and I haven't had a single crash since (Fedora 39 beta,
Linux 6.5.5).
This one parameter change made my system entirely unusable (it would crash
very quickly after booting).

On Sat, Sep 30, 2023 at 9:35 PM <email address hidden> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=201957
>
> --- Comment #95 from <email address hidden> ---
> (In reply to KC from comment #94)
>
> Did you have it set to 1 previously? If not, I'm not sure if that was the
> silver bullet, because it looks like it defaults to 0.
> https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html
>
> mcbp (int)
>
> It is used to enable mid command buffer preemption. (0 = disabled
> (default), 1
> = enabled)
>
> --
> You may reply to this email to add a comment.
>
> You are receiving this mail because:
> You are on the CC list for the bug.

99 comments hidden view all 109 comments
Revision history for this message
Pirouette Cacahuète (lissyx) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Pirouette Cacahuète (lissyx) wrote :
Revision history for this message
Erich Eickmeyer (eeickmeyer) wrote (last edit ):

Working with Pirouette on IRC, we determined this may be related to https://bugzilla.kernel.org/show_bug.cgi?id=201957#c94 in which the solution, sadly, was to add amdgpu.mcbp=0 to the kernel boot parameters. Per that bug report, it does appear as though this might be the result of a regression in the 6.5 kernel as they did not experience this issue in prior kernels or Ubuntu 23.04.

They also found mentions of https://gitlab.freedesktop.org/drm/amd/-/issues/2848 where Kernel 6.6 has a fix which we could pull a patch from, and we might have a patch for mesa at https://gitlab.freedesktop.org/drm/amd/-/issues/2848#note_2095536.

97 comments hidden view all 109 comments
Revision history for this message
Mario Limonciello (superm1) wrote :

6.5.6 has the fix for preemption issue, it should get fixed when stable updates come in Mantic.

Revision history for this message
Pirouette Cacahuète (lissyx) wrote :

Thanks, I'll try and keep you updated, however I am also facing bug 2039958 (probably a dupe of bug 2034619), so I might still need GNOME 45.1 to be released.

Revision history for this message
In , jer (jer-linux-kernel-bugs) wrote :

Hello, I'm having this same issue with my thinkpad z16 laptop, Ryzen 6850H and Radeon RX 6500M graphics card.

I do not use the laptop for gaming but for audio and video editing. I have not had trouble with any video editing software but I can easily reproduce the issue by loading up Ardour or Mixbus32C and either leaving it alone or working. After 15 minutes the screen freezes although audio will continue for a time. At this point Ardour or Mixbus will close and I can continue using the machine. If I load up either program again it will fail again, usually within a couple minutes and the whole laptop will freeze up until I ctrl-alt-F2 to get to a terminal prompt.

The issue always happens when Im recording audio with an HDMI device attached and 90% of the time without HDMI

I will attempt to set this kernel parameter amdgpu.mcbp=0 and report back.

Revision history for this message
In , jer (jer-linux-kernel-bugs) wrote :

(In reply to jeremy boyd from comment #97)
> Hello, I'm having this same issue with my thinkpad z16 laptop, Ryzen 6850H
> and Radeon RX 6500M graphics card.
>
> I do not use the laptop for gaming but for audio and video editing. I have
> not had trouble with any video editing software but I can easily reproduce
> the issue by loading up Ardour or Mixbus32C and either leaving it alone or
> working. After 15 minutes the screen freezes although audio will continue
> for a time. At this point Ardour or Mixbus will close and I can continue
> using the machine. If I load up either program again it will fail again,
> usually within a couple minutes and the whole laptop will freeze up until I
> ctrl-alt-F2 to get to a terminal prompt.
>
> The issue always happens when Im recording audio with an HDMI device
> attached and 90% of the time without HDMI
>
> I will attempt to set this kernel parameter amdgpu.mcbp=0 and report back.

I can confirm that this did not solve my problem. I tested my system out for several hours with no issue and thought that perhaps it had been solved but while doing a libreoffice presentation with my audio software running it happened again. here is the error from journalctl

Oct 22 09:40:01 fedora kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=433823, emitted seq=433825
Oct 22 09:40:01 fedora kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2189 thread Xorg:cs0 pid 2319
Oct 22 09:40:01 fedora kernel: amdgpu 0000:67:00.0: amdgpu: GPU reset begin!
Oct 22 09:40:02 fedora kernel: amdgpu 0000:67:00.0: amdgpu: MODE2 reset
Oct 22 09:40:02 fedora kernel: amdgpu 0000:67:00.0: amdgpu: GPU reset succeeded, trying to resume

Revision history for this message
In , mario.limonciello (mario.limonciello-linux-kernel-bugs) wrote :

#98

The amdgpu.mcbp=0 will only help GFX9 products. For GFX10 this is a different problem, please open at AMD Gitlab.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mesa (Ubuntu):
status: New → Confirmed
Revision history for this message
Pirouette Cacahuète (lissyx) wrote :

There's 6.5.0-15 package incoming on mantic-update, does it contains the fix?

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

no, -17 does

Displaying first 40 and last 40 comments. View all 109 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.