-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Linux kernel >= 6.8.0-44 #171
base: master
Are you sure you want to change the base?
Conversation
Switch the default mes to uni mes for gfx v12. V2: remove uni_mes set for gfx v11. Signed-off-by: Likun Gao <[email protected]> Reviewed-by: Jack Xiao <[email protected]>
Enable mmhub and athub cg on gc 12.0.0 Signed-off-by: Likun Gao <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Enable GFXOFF for GC v12.0.0. Signed-off-by: Likun Gao <[email protected]> Reviewed-by: Kenneth Feng <[email protected]>
add pp_dpm_dcefclk for smu 14.0.2/3 Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Jack Gui <[email protected]>
use mc address for wptr in add queue packet Signed-off-by: Frank Min <[email protected]> Reviewed-by: Jack Xiao <[email protected]> Acked-by: Alex Deucher <[email protected]>
gfx12 query video mem channel/type/width from umc_info of atom list, so fix it accordingly. Signed-off-by: Frank Min <[email protected]> Reviewed-by: Hawking Zhang <[email protected]> Acked-by: Alex Deucher <[email protected]>
disable gpo temprarily since it is not ready in fw Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Jack Gui <[email protected]>
create a new helper function to avoid compiler 'side-effect' check about RAS_EVENT_LOG() macro. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
gpu_id needs to be unique for user space to identify GPUs via KFD interface. In the current implementation there is a very small probability of having non unique gpu_ids. v2: Add check to confirm if gpu_id is unique. If not unique, find one Changed commit header to reflect the above v3: Use crc16 as suggested-by: Lijo Lazar <[email protected]> Ensure that gpu_id != 0 Signed-off-by: Harish Kasiviswanathan <[email protected]> Reviewed-by: Lijo Lazar <[email protected]> Reviewed-by: Felix Kuehling <[email protected]>
Fix up parameter descriptions. Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
To catch GPU mapping of system memory, TTM_PL_TT and AMDGPU_PL_PREEMPT must be checked. Fixes: 7c06cc729edc ("drm/amdkfd: mark GFX12 system and peer GPU memory mappings as MTYPE_NC") Signed-off-by: Sreekant Somasekharan <[email protected]> Reviewed-by: Felix Kuehling <[email protected]>
Fix up parameter descriptions. Reviewed-by: Hawking Zhang <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
GFX1201 was missed in the commit below. Adding it in. Fixes: 7c06cc729edc ("drm/amdkfd: mark GFX12 system and peer GPU memory mappings as MTYPE_NC") Signed-off-by: Sreekant Somasekharan <[email protected]> Reviewed-by: Alex Deucher <[email protected]>
add module parameter for jpeg. this is a temporary workaround for jpeg unit test fail on vcn 5.0 now. will be removed later. Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Sonny Jiang <[email protected]>
support pp_dpm_pcie on smu v14.0.2/3 Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Jack Gui <[email protected]>
When user sets an interval less than what driver can handle, soft lockup arises. To clear this soft lockup with adding a schedule before trigger a new host trap. [ 2896.405488] watchdog: BUG: soft lockup - CPU#22 stuck for 26s! [pcs_130:38057] [ 2896.405676] Supported: No, Unsupported modules are loaded [ 2896.405678] CPU: 22 PID: 38057 Comm: pcs_130 Kdump: loaded Tainted: G OE X N 5.14.21-150500.55.59-default ROCm#1 SLE15-SP5 3a8569df5696e57cdcb648c7e890af33bdc23f85 [ 2896.405683] Hardware name: Dell Inc. PowerEdge R7525/0590KW, BIOS 2.6.6 01/13/2022 [ 2896.405684] RIP: 0010:amdgpu_device_rreg.part.42+0x57/0x1d0 [amdgpu] [ 2896.405978] Code: 6f 4c 9c 00 4c 8b 83 b8 08 00 00 4d 01 e0 85 c9 74 15 65 48 8b 04 25 00 1c 02 00 3b 88 b8 09 00 00 0f 85 52 01 00 00 41 8b 28 <8b> 05 43 4c 9c 00 85 c0 74 56 65 48 8b 14 25 00 1c 02 00 39 82 b8 [ 2896.405981] RSP: 0018:ffffb7a6ecc33e30 EFLAGS: 00000246 [ 2896.405984] RAX: ffff949389f18000 RBX: ffff94d3d1100000 RCX: 00000000000094a9 [ 2896.405985] RDX: 0000000000000000 RSI: 0000000000002376 RDI: ffff94d3d1100000 [ 2896.405987] RBP: 0000000000000000 R08: ffffb7a6e2b88dd8 R09: ffff94d30e3b1f14 [ 2896.405989] R10: ffffb7a6c0427d88 R11: ffffb7a6ecc33c80 R12: 0000000000008dd8 [ 2896.405990] R13: 0000000000002376 R14: ffff94d30e3b1f14 R15: ffff94d3d1100000 [ 2896.405992] FS: 0000000000000000(0000) GS:ffff9512ff580000(0000) knlGS:0000000000000000 [ 2896.405994] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2896.405996] CR2: 00007f5b2b732000 CR3: 0000006a1ee10003 CR4: 0000000000770ee0 [ 2896.405998] PKRU: 55555554 [ 2896.405999] Call Trace: [ 2896.406004] <TASK> [ 2896.406007] kgd_gfx_v9_trigger_pc_sample_trap+0x1d6/0x4f0 [amdgpu 75bb93fc913928fc00917a1c71d5c2dca258175d] Signed-off-by: James Zhu <[email protected]> Tested-by: Vladimir Indic <[email protected]> Reviewed-by: Vladimir Indic <[email protected]>
When host trap pc sampling is activted. Since Command bus from SPI/SQG to SQ may have some conflict with SQ internal clock gating, when we have many host trap command it will trigger qcm fence timeout. Signed-off-by: James Zhu <[email protected]> Tested-by: Vladimir Indic <[email protected]> Reviewed-by: Vladimir Indic <[email protected]>
Signed-off-by: Asher Song <[email protected]>
Signed-off-by: Asher Song <[email protected]>
is_smca_umc_v2 function never occurs in upstream kernel, macro HAVE_SMCA_UMC_V2 is undefined all the time, which cause MCE notifications is not handled on MI200 A+A platform. So we drop macro HAVE_SMCA_UMC_V2. On the other hand, on Centos 7.9, SMCA_UMC_V2 is not defined in arch/x86/include/asm/mce.h, we don't care umc_v2 error notification on centos 7.9. Signed-off-by: Asher Song <[email protected]> Reviewed-by: Flora Cui <[email protected]> Reviewed-by: Bob Zhou <[email protected]>
When redefining HAVE_SMCA_UMC_V2, the fake function smca_get_bank_type is called by amdgpu_bad_page_notifier. However origin fake function can not be referenced when making intree build as it defined in amdkcl modules. So we make a macro for the fake function in backport/kcl_mce.h Signed-off-by: Asher Song <[email protected]> Reviewed-by: Flora Cui <[email protected]> Reviewed-by: Bob Zhou <[email protected]>
There is a typo in patch drm/amdkcl: fake smca_get_bank_type, fix it Signed-off-by: Asher Song <[email protected]> Reviewed-by: Lijo Lazar <[email protected]>
The parameters segment_width and last_segment_width are used to control the configuration of the Output Plane Processor (OPP), specifically the width of each segment that the display is divided into and the width of the last segment Fixes the below with gcc W=1: drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Function parameter or struct member 'segment_width' not described in 'optc35_set_odm_combine' drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Function parameter or struct member 'last_segment_width' not described in 'optc35_set_odm_combine' drivers/gpu/drm/amd/amdgpu/../display/dc/optc/dcn35/dcn35_optc.c:59: warning: Excess function parameter 'timing' description in 'optc35_set_odm_combine' Cc: Tom Chung <[email protected]> Cc: Rodrigo Siqueira <[email protected]> Cc: Roman Li <[email protected]> Cc: Aurabindo Pillai <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Tom Chung <[email protected]>
Align with new port same as smu 13.x. Signed-off-by: Kenneth Feng <[email protected]> Reviewed-by: Jack Gui <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
Update the capabilities for supporting 8k encoding. Reviewed-by: David (Ming Qiang) Wu <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Ruijing Dong <[email protected]>
The following commit updated gmc->noretry from 0 to 1 for GC HW IP 9.3.0: commit 5f3854f ("drm/amdgpu: add more cases to noretry=1") This causes the device to hang when a page fault occurs, until the device is rebooted. Instead, revert back to gmc->noretry=0 so the device is still responsive. Fixes: 5f3854f ("drm/amdgpu: add more cases to noretry=1") Signed-off-by: Tim Van Patten <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
./drivers/gpu/drm/amd/amdgpu/amdgpu.h: amdgpu_umsch_mm.h is included more than once. Reported-by: Abaci Robot <[email protected]> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9063 Signed-off-by: Jiapeng Chong <[email protected]> Signed-off-by: Alex Deucher <[email protected]>
…ing_set_wptr This commit removes a duplicate check for *is_queue_unmap in the sdma_v7_0_ring_set_wptr function. The check at line 171 was considered dead code because at this point in the code, we already know that *is_queue_unmap is false due to the check at line 161. By removing this unnecessary check, improves the readability of the code Fixes the below: drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c:171 sdma_v7_0_ring_set_wptr() warn: duplicate check '*is_queue_unmap' (previous on line 161) drivers/gpu/drm/amd/amdgpu/sdma_v7_0.c 140 static void sdma_v7_0_ring_set_wptr(struct amdgpu_ring *ring) 141 { 142 struct amdgpu_device *adev = ring->adev; 143 uint32_t *wptr_saved; 144 uint32_t *is_queue_unmap; 145 uint64_t aggregated_db_index; 146 uint32_t mqd_size = adev->mqds[AMDGPU_HW_IP_DMA].mqd_size; 147 148 DRM_DEBUG("Setting write pointer\n"); 149 150 if (ring->is_mes_queue) { 151 wptr_saved = (uint32_t *)(ring->mqd_ptr + mqd_size); 152 is_queue_unmap = (uint32_t *)(ring->mqd_ptr + mqd_size + ^^^^^^^^^^^^^^^^ Set here 153 sizeof(uint32_t)); 154 aggregated_db_index = 155 amdgpu_mes_get_aggregated_doorbell_index(adev, 156 ring->hw_prio); 157 158 atomic64_set((atomic64_t *)ring->wptr_cpu_addr, 159 ring->wptr << 2); 160 *wptr_saved = ring->wptr << 2; 161 if (*is_queue_unmap) { ^^^^^^^^^^^^^^^ Checked here 162 WDOORBELL64(aggregated_db_index, ring->wptr << 2); 163 DRM_DEBUG("calling WDOORBELL64(0x%08x, 0x%016llx)\n", 164 ring->doorbell_index, ring->wptr << 2); 165 WDOORBELL64(ring->doorbell_index, ring->wptr << 2); 166 } else { 167 DRM_DEBUG("calling WDOORBELL64(0x%08x, 0x%016llx)\n", 168 ring->doorbell_index, ring->wptr << 2); 169 WDOORBELL64(ring->doorbell_index, ring->wptr << 2); 170 --> 171 if (*is_queue_unmap) ^^^^^^^^^^^^^^^ This is dead code. We know it's false. 172 WDOORBELL64(aggregated_db_index, 173 ring->wptr << 2); 174 } 175 } else { 176 if (ring->use_doorbell) { 177 DRM_DEBUG("Using doorbell -- " 178 "wptr_offs == 0x%08x " Fixes: 6d9c711786e6 ("drm/amdgpu: Add sdma v7_0 ip block support (v7)") Cc: Likun Gao <[email protected]> Cc: Hawking Zhang <[email protected]> Cc: Christian König <[email protected]> Cc: Alex Deucher <[email protected]> Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Srinivasan Shanmugam <[email protected]> Reviewed-by: Likun Gao <[email protected]> Reviewed-by: Asad Kamal <[email protected]>
modify the lock type to 'spinlock' to avoid schedule issue in interrupt context. Signed-off-by: Yang Wang <[email protected]> Reviewed-by: Tao Zhou <[email protected]>
Add support to set/get information about different DPM policies. The support is only available on SOCs which use swsmu architecture. A DPM policy type may be defined with different levels. For example, a policy may be defined to select Pstate preference and then later a pstate preference may be chosen. Signed-off-by: Lijo Lazar <[email protected]> Acked-by: Alex Deucher <[email protected]> Reviewed-by: Asad Kamal <[email protected]>
Per firmware's requirement, replace mode2 with mode1. Signed-off-by: Tao Zhou <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
GFX v9.4.3 uses mode1 reset, other ASICs choose mode2. Signed-off-by: Tao Zhou <[email protected]> Acked-by: Lijo Lazar <[email protected]>
Since it is not stable on stress test. Signed-off-by: James Zhu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Reviewed-by: Vladimir Indic <[email protected]> Tested-by: Vladimir Indic <[email protected]>
…rarily not for upstream. -v2: fix typo -v3: rename kfd_ioctl_pc_sample_args "reserved" to "version" Signed-off-by: James Zhu <[email protected]> Reviewed-by: Felix Kuehling <[email protected]> Reviewed-by: Vladimir Indic <[email protected]> Tested-by: Vladimir Indic <[email protected]>
This reverts commit 6ac6a32. The fixed issue has disappeared, so revert the workaround. Signed-off-by: Bob Zhou <[email protected]> Reviewed-by: Jingwen Chen <[email protected]>
This reverts commit 4ff45ec. The fixed issue has disappeared, so revert the workaround. Signed-off-by: Bob Zhou <[email protected]> Reviewed-by: Jingwen Chen <[email protected]>
We send back the ready to reset message before we stop anything. This is wrong. Move it to when we are actually ready for the FLR to happen. In the current state since we take tens of seconds to stop everything, it is very likely that host would give up waiting and reset the GPU before we send ready, so it would be the same as before. But this gets rid of the hack with reset_domain locking and also let us tell how slow ready to reset actually is from the host. The ready to reset speed can be improved later. Signed-off-by: Yunxiang Li <[email protected]> Acked-by: Christian König <[email protected]> Reviewed-by: Emily Deng <[email protected]>
…dapter Signed-off-by: Vignesh Chander <[email protected]> Reviewed-by: Zhigang Luo <[email protected]>
For RAS error scenario, VF guest driver will check mailbox and set fed flag to avoid unnecessary HW accesses. additionally, poll for reset completion message first to avoid accidentally spamming multiple reset requests to host. v2: add another mailbox check for handling case where kfd detects timeout first v3: set host_flr bit and use wait_for_reset Signed-off-by: Vignesh Chander <[email protected]> Reviewed-by: Zhigang Luo <[email protected]>
Flag "mes.ring.shced.ready" will be set as true after mes hw init and set as false when mes hw fini to avoid duplicate initialization. But hw fini will not be called when function level reset, which will cause mes hw init be skipped during FLR, which will leads to mapping legacy queue fail. Set this flag as false when post reset will fix this issue. Signed-off-by: Lin.Cao <[email protected]> Acked-by: Alex Deucher <[email protected]>
Accessing registers via host is missing the check for skip_hw_access and the lockdep check that comes with it. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Christian König <[email protected]>
is_hws_hang and is_resetting serves pretty much the same purpose and they all duplicates the work of the reset_domain lock, just check that directly instead. This also eliminate a few bugs listed below and get rid of dqm->ops.pre_reset. kfd_hws_hang did not need to avoid scheduling another reset. If the on-going reset decided to skip GPU reset we have a bad time, otherwise the extra reset will get cancelled anyway. remove_queue_mes forgot to check is_resetting flag compared to the pre-MES path unmap_queue_cpsch, so it did not block hw access during reset correctly. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Felix Kuehling <[email protected]>
At this point the gart is not set up, there's no point to invalidate tlb here and it could even be harmful. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Christian König <[email protected]>
When amdgpu_gart_invalidate_tlb helper is introduced this part was left out of the conversion. Avoid the code duplication here. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Christian König <[email protected]>
Which method is used to flush tlb does not depend on whether a reset is in progress or not. We should skip flush altogether if the GPU will get reset. So put both path under reset_domain read lock. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Christian König <[email protected]> CC: [email protected]
We need to take the reset domain lock before flush hdp. We can't put the lock inside amdgpu_device_flush_hdp itself because it is used during reset where we already take the write side lock. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Christian König <[email protected]>
We need to take the reset domain lock before talking to MES. While in this case we can take the lock inside the mes helper. We can't do so for most other mes helpers since they are used during reset. So for consistency sake we add the lock here. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Felix Kuehling <[email protected]>
Here since we are in reset and takes the reset_domain write side lock already. We can't use the flush tlb helper which tries to take the read side. Signed-off-by: Yunxiang Li <[email protected]> Reviewed-by: Christian König <[email protected]>
This reverts commit d409c20. The commit is a partial revert that left things broken, also this was never ported back to drm-next. This revert is needed by patch series https://gerrit-git.amd.com/c/brahma/ec/linux/+/1068977
Add support to tune phase detect parameters. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Add debugfs nodes for enabling/disabling and tuning parameters used in phase detect. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Add support for enabling phase detect and tuning params for SMUv13.0.6 SOCs. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
Phase detect controls are only available for SMUv13.0.6 dGPUs. Create control object only on those. Signed-off-by: Lijo Lazar <[email protected]> Reviewed-by: Feifei Xu <[email protected]> Reviewed-by: Hawking Zhang <[email protected]>
…or Linux kernel >= 6.8.0-44
i wonder if this is a ubuntu specific problem, as they choose 6.8 kernel (sadly not a LTS one) , picked some patch in -44 and thus break their partner's code ! The bug is also reported on ubuntu, as it is their change that caused the bug in 6.8.0 (but this will need to be fixed for future kernels) https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2080823 |
Thanks for submitting this, though I wonder if this will still work with distributions outside of Ubuntu. I admit I don't know much about kernel development. |
torvalds/linux@5a507b7 is where the change happened. It is additionally a security issue, see CVE-2024-39498, which makes me think this will be backported to older kernels. The real question is what value of |
We've got a KCL-based solution coming in the next release. I'll leave this open for now in case people want it as a workaround |
Related ROCm/ROCm#3701 and thanks to @kswit for the reference and @alain-bkr for the solution.
I have guarded the code so it doesn't break older kernel versions.
Checking https://packages.ubuntu.com/noble/all/linux-headers-6.8.0-41/download, particularly the
Makefile
it is not clear what the version is. I can add a runtimeuname
usage if you like. On my 6.8.0-41 for example,/usr/include/linux/version.h
contains#define LINUX_VERSION_CODE 395276
. In Alpine in Docker with 6.6-r0 of linux-headersapk
on that same base, I get#define LINUX_VERSION_CODE 394752
. Pretty sure this version restriction in the PR is sufficient, but keep this comment in mind; happy to change the version.