Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disttask: Refactor taskExecutor slotManager to handle task occupation #49713

Merged
merged 10 commits into from
Dec 28, 2023

Conversation

okJiang
Copy link
Member

@okJiang okJiang commented Dec 22, 2023

What problem does this PR solve?

Issue Number: ref #49008

Problem Summary: we need free slots(resource) when higher task comes.

What changed and how does it work?

  • cancel the lower priority task when higher task comes, higher task is waiting slots
  • alloc slot to waiting task after lower task cancelled
  • can't alloc lower task when a task is waiting
  • higher task will occupy slots that the waiting task reserved, and waiting task reset

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 22, 2023
@okJiang okJiang changed the title disttask: Refactor slotManager to handle task occupation disttask: Refactor taskExecutor slotManager to handle task occupation Dec 22, 2023
Copy link

codecov bot commented Dec 22, 2023

Codecov Report

Merging #49713 (47a7e05) into master (b1e5d61) will increase coverage by 0.2138%.
Report is 68 commits behind head on master.
The diff coverage is 5.1724%.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #49713        +/-   ##
================================================
+ Coverage   70.9732%   71.1871%   +0.2138%     
================================================
  Files          1368       1430        +62     
  Lines        398095     427368     +29273     
================================================
+ Hits         282541     304231     +21690     
- Misses        95813     104068      +8255     
+ Partials      19741      19069       -672     
Flag Coverage Δ
integration 43.8731% <5.1724%> (?)
unit 70.9732% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 53.9663% <ø> (ø)
parser ∅ <ø> (∅)
br 46.1315% <ø> (-6.7724%) ⬇️

@okJiang
Copy link
Member Author

okJiang commented Dec 22, 2023

/cc @D3Hunter @ywqzzy @tangenta

@@ -218,7 +218,9 @@ func (m *Manager) onRunnableTasks(tasks []*proto.Task) {
}
logutil.Logger(m.logCtx).Info("detect new subtask", zap.Int64("task-id", task.ID))

if !m.slotManager.canAlloc(task) {
canAlloc, tasksNeedFree := m.slotManager.canAlloc(task)
m.onCanceledTasks(context.Background(), tasksNeedFree)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed in group, this method is used to make the task as cancelled, not cancel the task-executor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, we only need to cancel the tasks that has lower order and occupies the slots we need to run current task, not cancel all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancel(nil) can cancel the running task without modifying the task state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 202ea21

// priority will be used in future
priority int
slotCount int
taskWaitAlloc *proto.Task
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slot manager shouldn't manage which task is waiting, and not needed

Copy link
Member Author

@okJiang okJiang Dec 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean manage it in tastexecutor.Manager?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean manage it in tastexecutor.Manager?

No need to manage task since we fetch runnableTask in loop

})

usedSlots := 0
for _, slotInfo := range allSlotInfos {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check in backward to free tasks with lowest rank

return false, nil
}

allSlotInfos := make([]*proto.Task, 0, len(sm.executorSlotInfos))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can maintains this slice during alloc/free

usedSlots += slotInfo.Concurrency
if sm.available+usedSlots >= task.Concurrency {
sm.taskWaitAlloc = task
return false, tasksNeedFree
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should return true, tasksNeedFree, to make caller know that it can alloc after freeing slots taken by those tasks

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 26, 2023
@purelind
Copy link
Contributor

/test all

1 similar comment
@okJiang
Copy link
Member Author

okJiang commented Dec 26, 2023

/test all

@okJiang
Copy link
Member Author

okJiang commented Dec 27, 2023

/retest


// 4. task1 is released, task3 alloc success, start to run
ch <- context.Canceled
time.Sleep(time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't rely on time.sleep to wait free()

Copy link
Member Author

@okJiang okJiang Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time.Sleep waits here

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time is unreliable, still have chance of flaky test.

Copy link
Member Author

@okJiang okJiang Dec 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time is unreliable, still have chance of flaky test.

For what you've said, I think it needs to be analyzed on a case-by-case basis. There's obviously a waiting for checkTime, and I don't see any problem with waiting(sleep). I think what I can do is that refactor it to require.Eventually. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pkg/disttask/framework/taskexecutor/slot.go Outdated Show resolved Hide resolved
pkg/disttask/framework/taskexecutor/slot.go Outdated Show resolved Hide resolved
pkg/disttask/framework/taskexecutor/slot.go Outdated Show resolved Hide resolved
pkg/disttask/framework/taskexecutor/slot_test.go Outdated Show resolved Hide resolved
pkg/disttask/framework/taskexecutor/slot_test.go Outdated Show resolved Hide resolved
break
}

if !canAlloc {
logutil.Logger(m.logCtx).Warn("subtask has been rejected", zap.Int64("task-id", task.ID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be debug level or remove it, to avoid print too much

pkg/disttask/framework/taskexecutor/manager.go Outdated Show resolved Hide resolved
m.logErr(err)
return
} else if !exist {
continue
}
switch task.State {
case proto.TaskStateRunning:
if taskCtx.Err() != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems no need to print this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer to keep it. We can know some subtasks have been canceled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's already log at line 386

we also log in cancelTaskExecutors

Copy link
Member Author

@okJiang okJiang Dec 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated log level to debug

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just remove it

@okJiang okJiang requested review from D3Hunter and ywqzzy December 27, 2023 08:12
for _, task := range tasks {
logutil.Logger(m.logCtx).Info("cancelTasks", zap.Any("task_id", task.ID))
if cancel, ok := m.mu.handlingTasks[task.ID]; ok && cancel != nil {
// Pause all running subtasks, don't mark subtasks as canceled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the comments

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Pause all running subtasks, don't mark subtasks as canceled.
// Pause given subtasks temporarily, don't mark subtasks as canceled.

like this?

// Pause all running subtasks, don't mark subtasks as canceled.
// Should not change the subtask's state.
cancel(nil)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not m.slotManager.free(t.ID) here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 28, 2023
Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

pkg/disttask/framework/taskexecutor/manager_test.go Outdated Show resolved Hide resolved
Copy link

ti-chi-bot bot commented Dec 28, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: D3Hunter, ywqzzy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 28, 2023
Copy link

ti-chi-bot bot commented Dec 28, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-12-28 07:42:26.25448349 +0000 UTC m=+1724437.291710418: ☑️ agreed by ywqzzy.
  • 2023-12-28 08:38:55.116359409 +0000 UTC m=+1727826.153586337: ☑️ agreed by D3Hunter.

@okJiang
Copy link
Member Author

okJiang commented Dec 28, 2023

/retest

@ti-chi-bot ti-chi-bot bot merged commit a1d2b50 into pingcap:master Dec 28, 2023
18 checks passed
@okJiang okJiang deleted the disttask-rc/cancel-subtasks branch December 28, 2023 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants