Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Dependent] The date rules of the dependent node are ambiguous. #15129

Closed
2 of 3 tasks
reele opened this issue Nov 7, 2023 · 3 comments · Fixed by #15289
Closed
2 of 3 tasks

[Bug] [Dependent] The date rules of the dependent node are ambiguous. #15129

reele opened this issue Nov 7, 2023 · 3 comments · Fixed by #15289
Assignees
Labels
3.2.1 backend bug Something isn't working

Comments

@reele
Copy link
Contributor

reele commented Nov 7, 2023

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

目前在执行过程中有两种日期时间:

  1. 调度时间
  2. 执行时间(或开始时间)

但目前存在一个逻辑问题,就是dependent节点使用processInstance.scheduleTime作为日期基准去匹配startTime

3.2.0版本中,'dependent'节点的匹配逻辑是:

  1. 通过findLastProcessInterval找到依赖任务的ProcessInstance

    • 搜索顺序是:
      • a. 通过 queryLastSchedulerProcessInterval 查询 ProcessInstance, 条件是 schedule_time >= #{startTime} and schedule_time <= #{endTime}
      • b. 通过 queryLastManualProcessInterval 查询 ProcessInstance, 条件是 start_time >= #{startTime} and start_time <= #{endTime}

    ProcessInstance processInstance = findLastProcessInterval(dependentItem.getDefinitionCode(),
    dateInterval, testFlag);

  2. 通过findValidTaskListByProcessId匹配配依赖的任务

    List<TaskInstance> taskInstanceList =
    taskInstanceDao.queryValidTaskListByWorkflowInstanceId(processInstance.getId(), testFlag);
    for (TaskInstance task : taskInstanceList) {
    if (task.getTaskCode() == taskCode) {
    taskInstance = task;
    break;
    }
    }

dev版本中,'dependent'节点的匹配逻辑是:

  1. 通过queryLastTaskInstanceIntervalByTaskCode查找 TaskInstance, 条件是 start_time >= #{startTime} and start_time <= #{endTime}
    TaskInstance taskInstance =
    taskInstanceDao.queryLastTaskInstanceIntervalByTaskCode(depTaskCode, dateInterval, testFlag);

在我所处的场景(银行, 数据仓库)中, '业务日期'或'数据日期'尤其重要, 因为对于数据加工的过程, 依赖的重点在于 '哪天的数据已经处理完成', 而不是 '数据的处理在哪天执行完成过', 数据的使用方涉及监管报送、报表还有诸多管理类系统, 对数据所处的业务时间的敏感度非常高, 所以在银行行业内的专业调度工具一般主要考虑数据的营业日期, 而不是任务的执行时间.

再者, 银行相关的业务系统非常多, 除了直接营业系统外, 还有管理类系统、第三方数据源、第三方托管系统, 截止上月末的数据很有可能会在下个月的2-3号才会产生, 比如财务系统出账调账就会延迟几天, 这种情况非常多且常规, 目前我负责的数据仓库已经有40+个上游系统或数据源, 可能延迟的至少有5个, 而因上游延迟会推后的数据加工任务就有上百个.

所以在对业务日期非常敏感且存在可能延迟的环境下, 对于3.2.0dev版本有以下问题:
1. 对于3.2.0版本, 假设'dependent'节点(today)所依赖的任务T1在当天还没有执行, 这时我对T1执行了日期为3天前的补数操作, 这就会触发上面3.2.0版本1.b的逻辑导致依赖意外检测成功。
2. 对于dev版本, 假设上游的数据下发推迟了1天, 被依赖的任务T1在第二天才完成, 'dependent'节点包括它的下游节点永远也不会成功了, 因为永远不会有新任务的startTime发生在上一天。

目前我一直是维护一个本地化的版本,移除dependent节点对startTime的操作,完全依赖scheduleTime


The dependency logic of the dependent node in the current implementation has two types of datetime:

  1. Schedule time
  2. Execution time (or start time)

However, there is a logical issue with the dependent node using processInstance.scheduleTime as the date base to match startTime.

In version 3.2.0, the matching logic of the 'dependent' node is:

  1. Find the dependent task's ProcessInstance by calling findLastProcessInterval.

    • The search order is:
      • a. Query ProcessInstance by calling queryLastSchedulerProcessInterval with the condition schedule_time >= #{startTime} and schedule_time <= #{endTime}.
      • b. Query ProcessInstance by calling queryLastManualProcessInterval with the condition start_time >= #{startTime} and start_time <= #{endTime}.

    Reference link:

    ProcessInstance processInstance = findLastProcessInterval(dependentItem.getDefinitionCode(),
    dateInterval, testFlag);

  2. Match the dependent tasks by calling findValidTaskListByProcessId.
    Reference link:

    List<TaskInstance> taskInstanceList =
    taskInstanceDao.queryValidTaskListByWorkflowInstanceId(processInstance.getId(), testFlag);
    for (TaskInstance task : taskInstanceList) {
    if (task.getTaskCode() == taskCode) {
    taskInstance = task;
    break;
    }
    }

In the dev version, the matching logic of the 'dependent' node is:

  1. Find TaskInstance by calling queryLastTaskInstanceIntervalByTaskCode with the condition start_time >= #{startTime} and start_time <= #{endTime}.
    Reference link:
    TaskInstance taskInstance =
    taskInstanceDao.queryLastTaskInstanceIntervalByTaskCode(depTaskCode, dateInterval, testFlag);

In my scenario (banking industry, data warehouse), the 'business date' or 'data date' is particularly important because for data processing, the focus of dependency lies in 'which day's data has been processed', rather than 'when was the data processed'. The users of data include regulatory reporting, reports, and many management systems, which are highly sensitive to the business time of the data. Therefore, professional scheduling tools in the banking industry mainly consider the business date of the data, rather than the execution time of tasks.

Furthermore, there are many banking-related business systems, including direct business systems, management systems, third-party data sources, and third-party hosting systems. The data at the end of last month may not be available until the 2nd or 3rd day of next month, such as financial system reconciliation that delays for several days. This situation is very common. Currently, I am responsible for more than 40 upstream systems or data sources, and there may be delays in at least 5 of them. As a result, there are hundreds of data processing tasks that may be delayed due to upstream delays.

Therefore, in an environment where the business date is highly sensitive and there may be delays, there are problems with versions 3.2.0 and dev:
1. For version 3.2.0, assuming that the dependent task T1 that the 'dependent' node (today) depends on has not been executed on that day, if I perform a 'supplement' operation on it with a date three days ago, this will trigger the above logic in version 3.2.0, causing an unexpected detection of dependency success.
2. For the dev version, assuming that the upstream data is delayed by one day, and the dependent task T1 is completed on the second day, the 'dependent' node and its downstream nodes will never succeed because there will never be a new task's startTime that occurs on the previous day.

Currently, I have been maintaining a localized version that removes the operation of dependent node on startTime and relies solely on scheduleTime.

What you expected to happen

首先,用工作流的scheduleTime匹配任务的startTime是否合理?

其次,如果确实有不同的需求,是不是可以增加一个选项,指定依赖节点是检测调度时间还是实际开始时间


Firstly, is it reasonable to match the startTime of a task with the scheduleTime?

Secondly, if there are indeed different requirements, can an option be added to specify whether the dependent node detects the scheduleTime or the actual startTime?

How to reproduce

请参考上述内容


Please refer to the above content.

Anything else

No response

Version

dev

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@reele reele added bug Something isn't working Waiting for reply Waiting for reply labels Nov 7, 2023
Copy link

github-actions bot commented Nov 7, 2023

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

Currently there are two types of datetimes in execution:

  1. Scheduling time
  2. Execution time (or start time)

But there is currently a logical problem, that is, the dependent node uses processInstance.scheduleTime as the date base to match startTime.

In version 3.2.0, the matching logic of 'dependent' nodes is:

  1. Find the ProcessInstance of the dependent task through findLastProcessInterval
    • The search order is:
  • a. Query ProcessInstance through queryLastSchedulerProcessInterval, the condition is schedule_time >= #{startTime} and schedule_time <= #{endTime}
  • b. Query ProcessInstance through queryLastManualProcessInterval, the condition is start_time >= #{startTime} and start_time <= #{endTime}

ProcessInstance processInstance = findLastProcessInterval(dependentItem.getDefinitionCode(),
dateInterval, testFlag);

  1. Match dependent tasks through findValidTaskListByProcessId
    List<TaskInstance> taskInstanceList =
    taskInstanceDao.queryValidTaskListByWorkflowInstanceId(processInstance.getId(), testFlag);
    for (TaskInstance task : taskInstanceList) {
    if (task.getTaskCode() == taskCode) {
    taskInstance = task;
    break;
    }
    }

In the dev version, the matching logic of the 'dependent' node is:

  1. Find TaskInstance through queryLastTaskInstanceIntervalByTaskCode, the condition is start_time >= #{startTime} and start_time <= #{endTime}
    TaskInstance taskInstance =
    taskInstanceDao.queryLastTaskInstanceIntervalByTaskCode(depTaskCode, dateInterval, testFlag);

In the scenario I am in (bank, data warehouse), 'business date' or 'data date' is particularly important, because for the data processing process, the key point of dependence is 'which day the data has been processed', not ' On which day the data processing was completed? The users of the data involve regulatory submissions, reports, and many management systems. They are very sensitive to the business time where the data is located, so it is a professional scheduling tool in the banking industry. Generally, the business date of the data is mainly considered, rather than the execution time of the task.

Furthermore, there are many bank-related business systems. In addition to direct business systems, there are also management systems, third-party data sources, and third-party custody systems. The data as of the end of last month will most likely be available on the 2-3rd of next month. For example, the financial system will delay the payment and adjustment for several days. This situation is very common and routine. Currently, the data warehouse I am responsible for has more than 40 upstream systems or data sources, and at least 5 of them may be delayed. There are hundreds of data processing tasks that will be postponed due to upstream delays.

Therefore, in an environment that is very sensitive to business dates and has possible delays, there are the following problems for the 3.2.0 and dev versions:

  1. For the 3.2.0 version, assuming that the task T1 that the 'dependent' node (today) depends on has not been executed on the same day, then I executed the complement of the date 3 days ago on T1 operation, this will trigger the logic of the above 3.2.0 version 1.b, causing the dependency to be unexpectedly detected successfully.
  2. For the dev version, it is assumed that the upstream data delivery is delayed by 1 day, and the dependent task T1 is not completed until the next day. The 'dependent' node including its downstream nodes will never succeed because it will never succeed. There will be no new tasks whose startTime occurred on the previous day.

The dependency logic of the dependent node in the current implementation has two types of datetime:

  1. Schedule time
  2. Execution time (or start time)

However, there is a logical issue with the dependent node using processInstance.scheduleTime as the date base to match startTime.

In version 3.2.0, the matching logic of the 'dependent' node is:

  1. Find the dependent task's ProcessInstance by calling findLastProcessInterval.
    -The search order is:

    • a. Query ProcessInstance by calling queryLastSchedulerProcessInterval with the condition schedule_time >= #{startTime} and schedule_time <= #{endTime}.
    • b. Query ProcessInstance by calling queryLastManualProcessInterval with the condition start_time >= #{startTime} and start_time <= #{endTime}.

    Reference link:

    ProcessInstance processInstance = findLastProcessInterval(dependentItem.getDefinitionCode(),
    dateInterval, testFlag);

  2. Match the dependent tasks by calling findValidTaskListByProcessId.
    Reference link:

    List<TaskInstance> taskInstanceList =
    taskInstanceDao.queryValidTaskListByWorkflowInstanceId(processInstance.getId(), testFlag);
    for (TaskInstance task : taskInstanceList) {
    if (task.getTaskCode() == taskCode) {
    taskInstance = task;
    break;
    }
    }

In the dev version, the matching logic of the 'dependent' node is:

  1. Find TaskInstance by calling queryLastTaskInstanceIntervalByTaskCode with the condition start_time >= #{startTime} and start_time <= #{endTime}.
    Reference link:
    TaskInstance taskInstance =
    taskInstanceDao.queryLastTaskInstanceIntervalByTaskCode(depTaskCode, dateInterval, testFlag);

In my scenario (banking industry, data warehouse), the 'business date' or 'data date' is particularly important because for data processing, the focus of dependency lies in 'which day's data has been processed', rather than 'when was the data processed'. The users of data include regulatory reporting, reports, and many management systems, which are highly sensitive to the business time of the data. Therefore, professional scheduling tools in the banking industry mainly consider the business date of the data, rather than the execution time of tasks.

Furthermore, there are many banking-related business systems, including direct business systems, management systems, third-party data sources, and third-party hosting systems. The data at the end of last month may not be available until the 2nd or 3rd day of next month, such as financial system reconciliation that delays for several days. This situation is very common. Currently, I am responsible for more than 40 upstream systems or data sources, and there may be delays in at least 5 of them. As a As a result, there are hundreds of data processing tasks that may be delayed due to upstream delays.

Therefore, in an environment where the business date is highly sensitive and there may be delays, there are problems with versions 3.2.0 and dev:

  1. For version 3.2.0, assuming that the dependent task T1 that the 'dependent' node (today) depends on has not been executed on that day, if I perform a 'supplement' operation on it with a date three days ago, this will trigger the above logic in version 3.2.0, causing an unexpected detection of dependency success.
  2. For the dev version, assuming that the upstream data is delayed by one day, and the dependent task T1 is completed on the second day, the 'dependent' node and its downstream nodes will never succeed because there will never be a new task's startTime that occurs on the previous day.

What you expected to happen

First, is it reasonable to match the scheduleTime of the workflow to the startTime of the task?

Secondly, if there are indeed different needs, is it possible to add an option to specify whether the dependent node detects the scheduling time or the actual start time?


Firstly, is it reasonable to match the startTime of a task with the scheduleTime?

Secondly, if there are indeed different requirements, can an option be added to specify whether the dependent node detects the scheduleTime or the actual startTime?

How to reproduce

Please refer to the above


Please refer to the above content.

Anything else

No response

Version

dev

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@SbloodyS
Copy link
Member

SbloodyS commented Nov 8, 2023

Good catch! Would you like to submit a PR to fix it? @reele

@SbloodyS SbloodyS added backend 3.2.1 and removed Waiting for reply Waiting for reply labels Nov 8, 2023
@reele
Copy link
Contributor Author

reele commented Nov 8, 2023

Good catch! Would you like to submit a PR to fix it? @reele

Yeah, but I believe that the solution to this problem still needs further discussion. This issue in version 3.2.0 has been carried over from version 1.x.x and may have already been exploited by many people using the logic of 1.b, which is why someone submitted a PR to directly bypass the ProcessInstance.scheduleTime check for TaskInstance.startTime.

Similar issues have occurred before between versions 2.0.5 and 2.0.6, and after 2.0.6, it was also a direct bypass of the scheduleTime check for startTime. I have also submitted a corresponding PR, but it was directly rejected. Reference: #11725 and #11677. It is clear that there is a direct conflict of requirements regarding this issue among people.
@SbloodyS

reele pushed a commit to reele/dolphinscheduler that referenced this issue Dec 6, 2023
SbloodyS added a commit that referenced this issue Dec 8, 2023
…iguous. (#15289)

* [Fix-15129][Dependent] Fix the ambiguity in date rules for dependent node.

* [fix #15129] Revert ddl

* restore findLastProcessInterval

* update: mvn spotless:apply

---------

Co-authored-by: 李乐 <[email protected]>
Co-authored-by: xiangzihao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.2.1 backend bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants