RX 5500 XT Ubuntu 20.10不稳定,崩溃([drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *错误*等待栅栏超时!)

我最近构建了一个全AMD系统,配备了Ryzen 7 3700X CPU和RX 5500 XT Phantom D游戏GPU。我有一个Aorus Pro Wifi主板和32GB的Trident Z Neo RAM,并启用了XMP。

我正在运行带有5.6.13-050613-通用内核的Ubuntu 20.10。

amdgpu驱动程序冻结了GNOME和屏幕上的所有窗口,但没有鼠标,我一直遇到反复的问题。尽管关闭SSH即可正常运行(因此内核未挂起),但仍需要重新启动电源才能解决此问题。

这是该崩溃的内核日志的摘录:

635:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: [gfxhub] page fault (src_id:0 ring:40 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
636:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0:   in page starting at address 0x0000000000888000 from client 27
637:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00041C50
638:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0:          MORE_FAULTS: 0x0
639:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0:          WALKER_ERROR: 0x0
640:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0:          PERMISSION_FAULTS: 0x5
641:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0:          MAPPING_ERROR: 0x0
642:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0:          RW: 0x1
645:May 17 16:29:19 arctic kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
646:May 17 16:29:19 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=10870, emitted seq=10872
647:May 17 16:29:19 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
648:May 17 16:29:19 arctic kernel: amdgpu 0000:0b:00.0: GPU reset begin!
649:May 17 16:29:21 arctic kernel: amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
654:May 17 16:29:21 arctic kernel: amdgpu: [powerplay] SMU is resuming...
655:May 17 16:29:21 arctic kernel: amdgpu: [powerplay] SMU is resumed successfully!
659:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
660:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
661:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
662:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
663:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
664:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
665:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
666:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
667:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
668:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
669:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring sdma0 uses VM inv eng 12 on hub 0
670:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring sdma1 uses VM inv eng 13 on hub 0
671:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_dec uses VM inv eng 0 on hub 1
672:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_enc0 uses VM inv eng 1 on hub 1
673:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_enc1 uses VM inv eng 4 on hub 1
674:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring jpeg_dec uses VM inv eng 5 on hub 1
680:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: GPU reset(1) succeeded!
688:May 17 16:29:22 arctic /usr/lib/gdm3/gdm-x-session[2329]: amdgpu: amdgpu_cs_query_fence_status failed.
689:May 17 16:29:22 arctic gnome-shell[2678]: amdgpu: amdgpu_cs_query_fence_status failed.
709:May 17 16:33:23 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
728:May 17 16:39:00 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
852:May 17 16:49:44 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
917:May 17 20:12:32 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.

以下是一些日志(来自多个内核版本,很抱歉,我不确定确切来自哪个内核:

https://pastebin.com/ADL0JvHB

https://pastebin.com/vSprRyRx

我已经从内核5.4升级到5.5.19到5.6.13,问题仍然存在。

这是显示器随机断开连接后的崩溃日志(内核5.6.13):

May 18 02:30:57 arctic kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
May 18 02:30:57 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=167698, emitted seq=167700
May 18 02:30:57 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2090 thread Xorg:cs0 pid 2091
May 18 02:30:57 arctic kernel: amdgpu 0000:0b:00.0: GPU reset begin!
May 18 02:30:59 arctic kernel: amdgpu: [powerplay] failed send message: DisallowGfxOff (42) param: 0x00000000 response 0xffffffc2
May 18 02:31:02 arctic /usr/lib/gdm3/gdm-x-session[2090]: (II) event12 - Logitech MX Master 3000: SYN_DROPPED event - some input events have been lost.
May 18 02:31:02 arctic kernel: amdgpu: [powerplay] Msg issuing pre-check failed and SMU may be not in the right state!
May 18 02:31:02 arctic /usr/lib/gdm3/gdm-x-session[2090]: (EE) client bug: timer event12 debounce: scheduled expiry is in the past (-194ms), your system is too slow
May 18 02:31:02 arctic kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
May 18 02:31:02 arctic kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
May 18 02:31:04 arctic kernel: amdgpu: [powerplay] Msg issuing pre-check failed and SMU may be not in the right state!
May 18 02:31:04 arctic kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
May 18 02:31:07 arctic kernel: amdgpu: [powerplay] Msg issuing pre-check failed and SMU may be not in the right state! May 18 02:31:07 arctic kernel: [drm:amdgpu_device_gpu_recover.cold [amdgpu]] *ERROR* ASIC reset failed with error, -62 for drm dev, 0000:0b:00.0
May 18 02:31:07 arctic kernel: amdgpu 0000:0b:00.0: GPU reset(1) failed
May 18 02:31:07 arctic kernel: amdgpu 0000:0b:00.0: GPU reset end with ret = -62
May 18 02:31:12 arctic /usr/lib/gdm3/gdm-x-session[2090]: (EE) client bug: timer event12 debounce short: scheduled expiry is in the past (-5ms), your system is too slow May
18 02:31:17 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=167700, emitted seq=167700
May 18 02:31:17 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2090 thread Xorg:cs0 pid 2091
May 18 02:31:17 arctic kernel: amdgpu 0000:0b:00.0: GPU reset begin!

I've set AMD_DEBUG=nodma,nongg, but it doesn't help. I can update the BIOS on my motherboard, although I'm only one version off of most recent version, and it only provides "Memory enhancements." And I can try the proprietary amdgpu-pro drivers instead of the open-source amdgpu drivers. But I can't think of anything else. I've tried 3 separate kernels already... Anyone have ideas?

$ glxinfo | grep "OpenGL Version"
OpenGL version string: 4.6 (Compatibility Profile) Mesa 20.0.6