Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DatasetManager.register_dataset_change(): Fall back to datasets' "extra" if arg extra is empty #37006

Closed
2 tasks done
Blizzke opened this issue Jan 25, 2024 · 3 comments
Closed
2 tasks done
Labels

Comments

@Blizzke
Copy link

Blizzke commented Jan 25, 2024

Description

I would like to know if it is possible to change the behavior of the register_dataset_change function to copy the extra from the specified dataset if the actual extra argument is empty?

Use case/motivation

We have a lot of datasets where we, within the dataset, have versions of the data.
Because - for lineage etc - we want to keep the URI for all of those the same (it is the same dataset after all), we use the extra argument to pass along the version number etc. Unfortunately, after a change is registered, we lose that information.

It seems logical to us that when no extra is specified (which is actually impossible as far as we can see, since the current calling locations don't even allow it), that you fall back to the extra of the dataset. Our override is as simple as

class DatasetManager(OriginalDatasetManager):
    def register_dataset_change(self, *, task_instance: TaskInstance, dataset: Dataset, extra=None, session: Session,
                                **kwargs) -> None:
        if extra is None and dataset.extra:
            extra = dataset.extra

        super().register_dataset_change(task_instance=task_instance, dataset=dataset, extra=extra, session=session,
                                        **kwargs)

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@Blizzke Blizzke added kind:feature Feature Requests needs-triage label for new issues that we didn't triage yet labels Jan 25, 2024
@RNHTTR RNHTTR added good first issue and removed needs-triage label for new issues that we didn't triage yet labels Jan 27, 2024
@uranusjr
Copy link
Member

uranusjr commented Mar 5, 2024

See #37810 and issues and PRs linked in it (including in comments). Doing this fallback would go against the original design vision.

@uranusjr uranusjr closed this as not planned Won't fix, can't repro, duplicate, stale Mar 5, 2024
@Blizzke
Copy link
Author

Blizzke commented Mar 8, 2024

I’m sorry but if I understand the original discussion that is completely different from what we try to accomplish.

I view the extra as a way to quantify the uri more. The uri identifies the dataset, the extra contains specific information on how/what was used in/from that dara when the task was triggered. In our case this is usually the data version.

I obviously can’t force you to implement it like this, but for us the uri identifies which dataset and triggers any scheduling, the extra identifies exactly which version of the data was used and is unique to every task that uses it as in/outlet. So I would argue that data is part of the dataset and not xcom.

@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

I don't think it's the matter of forcing anyone or convincing single person.

Looks like you just want to propose a different way of treating extras - or maybe a completely new feature of dataset. And the right way of doing it @Blizzke is as usual - open a devlist discussion (not a github issue) where you explain your rationale, present your proposal, convince people at the devlist to your idea and in case of a bigger change write an Airflow Improvement Proposal describing your - apparently - design change (or addition) to the DataSet feature. Then you either are able to reach consensus that your idea is good. or when you can't reach consensus - you call for a vote.

This is how it works when you propose a design change to an important feature of Airflow that impact everyone using it - there is no other way. You will find all the links in https://airflow.apache.org/community/

It's entirely up to the arguments you present and how convincing you are and how complete your proposal will be to get others to agree to it. It's entirely in your hands, you just need to put an effort to convince those from the community who will take part in the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants