Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TaskGroupQueue will never be wakeup due to wakeup failed at one time #15528

Conversation

ruanwenjun
Copy link
Member

@ruanwenjun ruanwenjun commented Jan 28, 2024

Purpose of the pull request

  • Fix the Task which using TaskGroup will never be wakeup due to wakeup failed at one time.

Brief change log

In DS, we use TaskGroup to restrict the Task parallelism.
e.g. We have two task TaskA, TaskB, and they belong to the same task group, if we execute these two task at the same time, only one will get the taskGroup slot and the other will wait until be wakeup.
image
image
image

Unfortunately, if the TaskA wakeup TaskB failed, then TaskB will never be wakeup, since we use put mode, the waiting task will only be wakeup by other task, it will not try to pull the slot.

This PR will refactor this.
image

This PR add a new component TaskGroupCoordinator which use to deal with acquire/release task group slot and notify task.

Once a task need to acquire TaskGroup slot, it should call the TaskGroupCoordinator::acquireTaskGroupSlot, in this method, will insert a in queue TaskGroupQueue which status is WAITING.
Once a task is finished it will call TaskGroupCoordinator::releaseTaskGroupSlot, in this method, will move the TaskGroupQueue out queue.
image

And TaskGroupCoordinator will loop the slot, if there exist available slot, it will try to wakeup a waiting task.

Verify this pull request

This pull request is code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(or)

If your pull request contain incompatible change, you should also add it to docs/docs/en/guide/upgrede/incompatible.md

@codecov-commenter
Copy link

codecov-commenter commented Jan 28, 2024

Codecov Report

Attention: 192 lines in your changes are missing coverage. Please review.

Comparison is base (2f66a66) 38.17% compared to head (1cb5335) 38.38%.

❗ Current head 1cb5335 differs from pull request most recent head 3681244. Consider uploading reports for the commit 3681244 to get more accurate results

Files Patch % Lines
.../master/runner/taskgroup/TaskGroupCoordinator.java 42.10% 114 Missing and 7 partials ⚠️
.../server/master/runner/WorkflowExecuteRunnable.java 4.00% 22 Missing and 2 partials ⚠️
...aster/rpc/TaskInstanceWakeupOperationFunction.java 0.00% 16 Missing ⚠️
...cheduler/api/service/impl/ExecutorServiceImpl.java 0.00% 11 Missing ⚠️
.../server/master/event/StateEventHandlerManager.java 0.00% 6 Missing ⚠️
...cheduler/dao/repository/impl/TaskGroupDaoImpl.java 61.53% 2 Missing and 3 partials ⚠️
...ler/dao/repository/impl/TaskGroupQueueDaoImpl.java 78.57% 1 Missing and 2 partials ⚠️
...er/api/service/impl/TaskGroupQueueServiceImpl.java 0.00% 1 Missing ⚠️
...heduler/api/service/impl/TaskGroupServiceImpl.java 83.33% 0 Missing and 1 partial ⚠️
...e/dolphinscheduler/server/master/MasterServer.java 0.00% 1 Missing ⚠️
... and 3 more
Additional details and impacted files
@@             Coverage Diff              @@
##                dev   #15528      +/-   ##
============================================
+ Coverage     38.17%   38.38%   +0.20%     
- Complexity     4709     4748      +39     
============================================
  Files          1304     1305       +1     
  Lines         44828    44841      +13     
  Branches       4807     4800       -7     
============================================
+ Hits          17115    17212      +97     
+ Misses        25860    25760     -100     
- Partials       1853     1869      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ruanwenjun ruanwenjun force-pushed the dev_wenjun_fixTaskGroupQueueWillNeverWakeup branch from 38ff493 to 212ba8d Compare January 28, 2024 06:36
@github-actions github-actions bot added the CI&CD label Jan 28, 2024
Copy link

sonarcloud bot commented Jan 28, 2024

Please retry analysis of this Pull-Request directly on SonarCloud

@ruanwenjun ruanwenjun force-pushed the dev_wenjun_fixTaskGroupQueueWillNeverWakeup branch 5 times, most recently from 2d17d56 to 98a8d8f Compare January 28, 2024 14:59
@caishunfeng caishunfeng added this to the 3.2.1 milestone Jan 29, 2024
@ruanwenjun ruanwenjun force-pushed the dev_wenjun_fixTaskGroupQueueWillNeverWakeup branch from 98a8d8f to 3cfd45c Compare January 29, 2024 03:50
Comment on lines +552 to +554
ProcessInstance processInstance = processInstanceDao.queryOptionalById(taskGroupQueue.getProcessId())
.orElseThrow(
() -> new ServiceException(Status.PROCESS_INSTANCE_NOT_EXIST, taskGroupQueue.getProcessId()));

Check notice

Code scanning / CodeQL

Unread local variable Note

Variable 'ProcessInstance processInstance' is never read.
@ruanwenjun ruanwenjun force-pushed the dev_wenjun_fixTaskGroupQueueWillNeverWakeup branch 10 times, most recently from 59d46b2 to 82c1bce Compare January 30, 2024 09:01
@ruanwenjun ruanwenjun force-pushed the dev_wenjun_fixTaskGroupQueueWillNeverWakeup branch 2 times, most recently from ec952bc to 628f690 Compare January 30, 2024 10:34
@ruanwenjun ruanwenjun force-pushed the dev_wenjun_fixTaskGroupQueueWillNeverWakeup branch 2 times, most recently from e36844f to b9e452e Compare January 31, 2024 14:37
Comment on lines +57 to +59
if (taskGroupId == null) {
throw new IllegalArgumentException("taskGroupId cannot be null");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any reason we do not use @NonNull directly for parameter taskGroupId?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nonnull Cannot write error message. If we want to use a specific error message ,we cannot use @nonnull.

Comment on lines +85 to +98
DROP PROCEDURE IF EXISTS create_idx_t_ds_task_group_queue_in_queue;
delimiter d//
CREATE PROCEDURE create_idx_t_ds_task_group_queue_in_queue()
BEGIN
DECLARE index_exists INT DEFAULT 0;
SELECT COUNT(*) INTO index_exists FROM information_schema.statistics WHERE table_schema = (SELECT DATABASE()) AND table_name = 't_ds_task_group_queue' AND index_name = 'idx_t_ds_task_group_queue_in_queue';
IF index_exists = 0 THEN CREATE INDEX idx_t_ds_task_group_queue_in_queue ON t_ds_task_group_queue(in_queue);
END IF;
END;
d//
delimiter ;
CALL create_idx_t_ds_task_group_queue_in_queue;
DROP PROCEDURE create_idx_t_ds_task_group_queue_in_queue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could directly remove the related 3.3.0 upgrade. I would merge almost all upgrade sql from 3.3.0 to 3.2.1. and we will not release 3.3.0 until we close or nearly close 3.2.x

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can remove the upgrade 3.3.0. We can use another PR to do this.

Copy link
Contributor

@caishunfeng caishunfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ruanwenjun ruanwenjun force-pushed the dev_wenjun_fixTaskGroupQueueWillNeverWakeup branch from b9e452e to 3681244 Compare February 1, 2024 04:56
Copy link

sonarcloud bot commented Feb 1, 2024

Quality Gate Failed Quality Gate failed

Failed conditions

42.1% Coverage on New Code (required ≥ 60%)

See analysis details on SonarCloud

Copy link
Member

@Gallardot Gallardot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job! Rest LGTM.

log.warn("The TaskInstance: {} state: {} finished, will release the TaskGroupQueue: {}",
taskInstance.getName(), taskInstance.getState(), taskGroupQueue);
releaseTaskGroupQueueSlot(taskGroupQueue);
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless code

taskGroupQueue.setInQueue(Flag.YES.getCode());
taskGroupQueue.setStatus(TaskGroupQueueStatus.ACQUIRE_SUCCESS);
taskGroupQueue.setUpdateTime(new Date());
taskGroupQueueDao.updateById(taskGroupQueue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some repetitive code in the state change that might be suitable for turning into a function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, I will use another PR to move these kind of code in the whole project into function.

@ruanwenjun ruanwenjun merged commit 8974233 into apache:dev Feb 1, 2024
62 of 63 checks passed
@ruanwenjun ruanwenjun deleted the dev_wenjun_fixTaskGroupQueueWillNeverWakeup branch February 1, 2024 09:14
zhongjiajie pushed a commit that referenced this pull request Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants