Use Task Output as Extra for Dataset Trigger and DAG Run Conf #38432

jscheffl · 2024-03-23T22:31:33Z

This PR is a proposal to fix/resolve the request for feature as described in #37810 as a thin approach.

Idea for this solution:

Allow using task result as dynamic extra
- If dataset defined extra_from_return=True use task result as extra
Pass the dict of extra to the data triggered DAG as configuration/parameter
- If multiple events trigger a DAG, extra dictionaries are merged

This PR super-seeds PR #37888

closes: #37810

…com-output-as-extra-for-dataset-trigger

dirrao · 2024-03-25T06:11:23Z

airflow/models/taskinstance.py

        if TYPE_CHECKING:
            assert self.task

        for obj in self.task.outlets or []:
            self.log.debug("outlet obj %s", obj)
            # Lineage can have other types of objects besides datasets
            if isinstance(obj, Dataset):
+                if obj.extra:


Can we merge the event static information with dynamic information?

No we should not :-(
I (previously) interpreted the extrafield as being dynamic data. But as it is (actually) intended to be meta data for the dataset itself I mis-interpreted this as an option to pass extra data for this "reference" to a Dataset (interpreted by the URI).

uranusjr · 2024-03-25T08:39:48Z

airflow/models/taskinstance.py

+                if obj.extra:
+                    extra = obj.extra


I do not think this is right. It’s made quite clear in previous issues that Dataset.extras and DatasetEvent.extras are different things and should be kept separate.

Sorry, then it seems I mis-understood your comments in the previous PR7discussions about this. Thanks for clearly documenting the differences between Dataset and DatasetEvent(both) extra details in #38481. This opened my eyes and now I understand your push-back.
Now it is "clear" to me. Intent was not to overwrite or mangle "static" information for events.

uranusjr · 2024-03-25T08:42:39Z

airflow/models/taskinstance.py

+                if obj.extra:
+                    extra = obj.extra
+                elif obj.extra_from_return:
+                    extra = result if isinstance(result, dict) else {str(self.task_id): result}


Automatically putting the value under a key is too magical to me. I would prefer this either just forward the value (the extra field is capable of storing any JSON-able values, after all), or skip the value entirely if it has an unexpected type. The task ID is also not a particularly obvious key either.

uranusjr · 2024-03-25T08:44:52Z

airflow/jobs/scheduler_job_runner.py

+                run_conf = {}
+                for event in dataset_events:
+                    if event.extra:
+                        run_conf.update(event.extra)


I would prefer we put this in a separate PR to discuss. It’s not entirely obvious how extras should be merged from different events that trigger the run.

uranusjr · 2024-03-25T08:48:09Z

I also don’t quite feel comfortable the return value is still pushed to XCom if it is used as event extra. IMO it should go either one or the other, not both.

shrukul · 2024-03-26T17:34:38Z

airflow/models/taskinstance.py

@@ -2455,7 +2456,7 @@ def _run_raw_task(

            try:
                if not mark_success:
-                    self._execute_task_with_callbacks(context, test_mode, session=session)
+                    result = self._execute_task_with_callbacks(context, test_mode, session=session)


It seems that not all operators return a result on execution (example: DatabricksSubmitRunOperator)

Some operators do return a result - like (GlueOperator, EmrServerlessStartJobOperator), but the return value is job_id in these scenarios.

Questions -

If any of the above operators publish an Airflow dataset, how to specify extra dictionary in the corresponding dataset?

Regarding 1+2: Yes, if no result is generated, the content will be just None. If it is a scalar like a job_id it is rather a string. So in such cases the output is not usable for passing along to the dataset event.

Regarding your question 1: The publish of the event actually is happening after this line of code. The change was attemping to capture the result to put is as extra a few lines of code later from here.

uranusjr · 2024-03-27T07:13:37Z

I gave this a longer thought and I think I’m not particularly fond of the approach mainly because it mixes Dataset.extra and DatasetEvent.extra. As described in #37810, the current design is very explicit on the two being separate—Dataset.extra describes the thing that’s pointed to by the URI, while DatasetEvent.extra describes the data written into the thing represented by the URI. I quoted a few paragraphs from @blag in the issue. The design here directly passes through Dataset.extra to DatasetEvent.extra, which I believe would not make sense to most people. (For the record, I actually proposed copying Dataset.extra to DatasetEvent.extra initially, but @jedcunningham convinced me to not do it before I published #37810.)

The extra_from_return argument also has the same problem fundamentally—the flag controlling DatasetEvent behaviour should not be set on Dataset. It should be on the task instead (or on DatasetEvent itself, but that’s awkward because the event does not exist conceptually when we want to emit the extra).

jscheffl · 2024-03-27T22:27:08Z

I gave this a longer thought and I think I’m not particularly fond of the approach mainly because it mixes Dataset.extra and DatasetEvent.extra. As described in #37810, the current design is very explicit on the two being separate—Dataset.extra describes the thing that’s pointed to by the URI, while DatasetEvent.extra describes the data written into the thing represented by the URI. I quoted a few paragraphs from @blag in the issue. The design here directly passes through Dataset.extra to DatasetEvent.extra, which I believe would not make sense to most people. (For the record, I actually proposed copying Dataset.extra to DatasetEvent.extra initially, but @jedcunningham convinced me to not do it before I published #37810.)

The extra_from_return argument also has the same problem fundamentally—the flag controlling DatasetEvent behaviour should not be set on Dataset. It should be on the task instead (or on DatasetEvent itself, but that’s awkward because the event does not exist conceptually when we want to emit the extra).

Thanks for sharing your concerns. After reading #38481 I better understand what you were meaning. I did not understand the difference between Dataset.extra (Which I always felt like "Dataset" being the holder for the URI as pointer towards a dataset) and the DatasetEvent.extra (which I did not notice before, expected this is the result of the outlet from execution not a definition made by the user). You ehancement in the docs in PR #38481 makes it now clear.

Therefore I'm now in favor of PR #38481 where you propose a simple but explicit possibility to attach extra per outlet w/o mixing with return or XCom.

Yes, and in-deed, the merging of multiple extras as DAG run conf is something separate - thought it is a 2-for-1 in this PR to publish the extra dynamically here as well as show how to easily consume it.

So I will close this "test baloon" PR here and "maybe" will open a separate PR for the topic of using extra as DAG run conf... if time.

jscheffl added 7 commits March 6, 2024 00:13

Pass task output as outlet to dataset trigger params

5a242a7

Merge remote-tracking branch 'upstream/main' into feature/37810-use-x…

c841ba5

…com-output-as-extra-for-dataset-trigger

Rework code proposal

fcae73f

Add documentation

29b7fde

Add documentation

0d0c5e8

Fix pytests

63bedaa

Extend dataset example

847a824

jscheffl requested a review from uranusjr March 23, 2024 22:31

jscheffl requested review from potiuk, kaxil, XD-DENG and ashb as code owners March 23, 2024 22:31

boring-cyborg bot added area:Scheduler including HA (high availability) scheduler kind:documentation labels Mar 23, 2024

Add Newsfragment

b562990

jscheffl requested a review from jedcunningham March 23, 2024 22:42

dirrao reviewed Mar 25, 2024

View reviewed changes

uranusjr reviewed Mar 25, 2024

View reviewed changes

uranusjr mentioned this pull request Mar 26, 2024

Implement context accessor for DatasetEvent extra #38481

Merged

shrukul reviewed Mar 26, 2024

View reviewed changes

jscheffl closed this Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Task Output as Extra for Dataset Trigger and DAG Run Conf #38432

Use Task Output as Extra for Dataset Trigger and DAG Run Conf #38432

jscheffl commented Mar 23, 2024 •

edited

Loading

dirrao Mar 25, 2024

jscheffl Mar 27, 2024

uranusjr Mar 25, 2024

jscheffl Mar 27, 2024

uranusjr Mar 25, 2024

uranusjr Mar 25, 2024

uranusjr commented Mar 25, 2024

shrukul Mar 26, 2024

jscheffl Mar 27, 2024

uranusjr commented Mar 27, 2024

jscheffl commented Mar 27, 2024

Use Task Output as Extra for Dataset Trigger and DAG Run Conf #38432

Use Task Output as Extra for Dataset Trigger and DAG Run Conf #38432

Conversation

jscheffl commented Mar 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uranusjr commented Mar 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uranusjr commented Mar 27, 2024

jscheffl commented Mar 27, 2024

jscheffl commented Mar 23, 2024 •

edited

Loading