-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Task Output as Extra for Dataset Trigger and DAG Run Conf #38432
Use Task Output as Extra for Dataset Trigger and DAG Run Conf #38432
Conversation
…com-output-as-extra-for-dataset-trigger
if TYPE_CHECKING: | ||
assert self.task | ||
|
||
for obj in self.task.outlets or []: | ||
self.log.debug("outlet obj %s", obj) | ||
# Lineage can have other types of objects besides datasets | ||
if isinstance(obj, Dataset): | ||
if obj.extra: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we merge the event static information with dynamic information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we should not :-(
I (previously) interpreted the extra
field as being dynamic data. But as it is (actually) intended to be meta data for the dataset itself I mis-interpreted this as an option to pass extra data for this "reference" to a Dataset (interpreted by the URI).
if obj.extra: | ||
extra = obj.extra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is right. It’s made quite clear in previous issues that Dataset.extras
and DatasetEvent.extras
are different things and should be kept separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, then it seems I mis-understood your comments in the previous PR7discussions about this. Thanks for clearly documenting the differences between Dataset
and DatasetEvent
(both) extra
details in #38481. This opened my eyes and now I understand your push-back.
Now it is "clear" to me. Intent was not to overwrite or mangle "static" information for events.
if obj.extra: | ||
extra = obj.extra | ||
elif obj.extra_from_return: | ||
extra = result if isinstance(result, dict) else {str(self.task_id): result} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automatically putting the value under a key is too magical to me. I would prefer this either just forward the value (the extra field is capable of storing any JSON-able values, after all), or skip the value entirely if it has an unexpected type. The task ID is also not a particularly obvious key either.
run_conf = {} | ||
for event in dataset_events: | ||
if event.extra: | ||
run_conf.update(event.extra) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer we put this in a separate PR to discuss. It’s not entirely obvious how extras should be merged from different events that trigger the run.
I also don’t quite feel comfortable the return value is still pushed to XCom if it is used as event extra. IMO it should go either one or the other, not both. |
@@ -2455,7 +2456,7 @@ def _run_raw_task( | |||
|
|||
try: | |||
if not mark_success: | |||
self._execute_task_with_callbacks(context, test_mode, session=session) | |||
result = self._execute_task_with_callbacks(context, test_mode, session=session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It seems that not all operators return a
result
on execution (example: DatabricksSubmitRunOperator) - Some operators do return a
result
- like (GlueOperator, EmrServerlessStartJobOperator), but the return value isjob_id
in these scenarios.
Questions -
- If any of the above operators publish an Airflow dataset, how to specify
extra
dictionary in the corresponding dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding 1+2: Yes, if no result is generated, the content will be just None
. If it is a scalar like a job_id it is rather a string. So in such cases the output is not usable for passing along to the dataset event.
Regarding your question 1: The publish of the event actually is happening after this line of code. The change was attemping to capture the result to put is as extra a few lines of code later from here.
I gave this a longer thought and I think I’m not particularly fond of the approach mainly because it mixes The |
Thanks for sharing your concerns. After reading #38481 I better understand what you were meaning. I did not understand the difference between Therefore I'm now in favor of PR #38481 where you propose a simple but explicit possibility to attach extra per outlet w/o mixing with return or XCom. Yes, and in-deed, the merging of multiple extras as DAG run conf is something separate - thought it is a 2-for-1 in this PR to publish the So I will close this "test baloon" PR here and "maybe" will open a separate PR for the topic of using extra as DAG run conf... if time. |
This PR is a proposal to fix/resolve the request for feature as described in #37810 as a thin approach.
Idea for this solution:
extra
extra_from_return=True
use task result as extraextra
to the data triggered DAG as configuration/parameterextra
dictionaries are mergedThis PR super-seeds PR #37888
closes: #37810