Skip to content

Commit

Permalink
drm/sched: Partial revert of 'drm/sched: Keep s_fence->parent pointer'
Browse files Browse the repository at this point in the history
Problem:
This patch caused negative refcount as described in [1] because
for that case parent fence did not signal by the time of drm_sched_stop and hence
kept in pending list the assumption was they will not signal and
so fence was put to account for the s_fence->parent refcount but for
amdgpu which has embedded HW fence (always same parent fence)
drm_sched_fence_release_scheduled was always called and would
still drop the count for parent fence once more. For jobs that
never signaled this imbalance was masked by refcount bug in
amdgpu_fence_driver_clear_job_fences that would not drop
refcount on the fences that were removed from fence drive
fences array (against prevois insertion into the array in
get in amdgpu_fence_emit).

Fix:
Revert this patch and by setting s_job->s_fence->parent to NULL
as before prevent the extra refcount drop in amdgpu when
drm_sched_fence_release_scheduled is called on job release.

Also - align behaviour in drm_sched_resubmit_jobs_ext with that of
drm_sched_main when submitting jobs - take a refcount for the
new parent fence pointer and drop refcount for original kref_init
for new HW fence creation (or fake new HW fence in amdgpu - see next patch).

[1] - https://lore.kernel.org/all/[email protected]/t/#r00c728fcc069b1276642c325bfa9d82bf8fa21a3

Signed-off-by: Andrey Grodzovsky <[email protected]>
Tested-by: Yiqing Yao <[email protected]>
Acked-by: Christian König <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
  • Loading branch information
Andrey Grodzovsky authored and alexdeucher committed Jun 28, 2022
1 parent 9e225fb commit 45ecaea
Showing 1 changed file with 10 additions and 3 deletions.
13 changes: 10 additions & 3 deletions drivers/gpu/drm/scheduler/sched_main.c
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,8 @@ void drm_sched_stop(struct drm_gpu_scheduler *sched, struct drm_sched_job *bad)
if (s_job->s_fence->parent &&
dma_fence_remove_callback(s_job->s_fence->parent,
&s_job->cb)) {
dma_fence_put(s_job->s_fence->parent);
s_job->s_fence->parent = NULL;
atomic_dec(&sched->hw_rq_count);
} else {
/*
Expand Down Expand Up @@ -548,7 +550,6 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max)
if (found_guilty && s_job->s_fence->scheduled.context == guilty_context)
dma_fence_set_error(&s_fence->finished, -ECANCELED);

dma_fence_put(s_job->s_fence->parent);
fence = sched->ops->run_job(s_job);
i++;

Expand All @@ -558,7 +559,11 @@ void drm_sched_resubmit_jobs_ext(struct drm_gpu_scheduler *sched, int max)

s_job->s_fence->parent = NULL;
} else {
s_job->s_fence->parent = fence;

s_job->s_fence->parent = dma_fence_get(fence);

/* Drop for orignal kref_init */
dma_fence_put(fence);
}
}
}
Expand Down Expand Up @@ -955,14 +960,16 @@ static int drm_sched_main(void *param)

if (!IS_ERR_OR_NULL(fence)) {
s_fence->parent = dma_fence_get(fence);
/* Drop for original kref_init of the fence */
dma_fence_put(fence);

r = dma_fence_add_callback(fence, &sched_job->cb,
drm_sched_job_done_cb);
if (r == -ENOENT)
drm_sched_job_done(sched_job);
else if (r)
DRM_DEV_ERROR(sched->dev, "fence add callback failed (%d)\n",
r);
dma_fence_put(fence);
} else {
if (IS_ERR(fence))
dma_fence_set_error(&s_fence->finished, PTR_ERR(fence));
Expand Down

0 comments on commit 45ecaea

Please sign in to comment.