Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite how dag to dataset / dataset alias are stored #41987

Merged
merged 3 commits into from
Sep 6, 2024

Conversation

Lee-W
Copy link
Member

@Lee-W Lee-W commented Sep 4, 2024

Why

If someone subclasses a dataset or a dataset alias and makes it unhashable. The dag ref will not be able to be handled.

What

Save the data needed as (type, URI or name) format in dag ref


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@Lee-W Lee-W force-pushed the fix-missing-key-in-stored-datasets branch from 7f1c45e to 40f2e0a Compare September 4, 2024 08:42
@Lee-W Lee-W marked this pull request as draft September 4, 2024 08:51
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cautious approve :)

@uranusjr
Copy link
Member

uranusjr commented Sep 4, 2024

Doesn’t Dataset do this automatically? We shouldn’t let an unsanitised URI get into the system in the first place.

uri: str = attr.field(
converter=_sanitize_uri,
validator=[attr.validators.min_len(1), attr.validators.max_len(3000)],
)

@Lee-W
Copy link
Member Author

Lee-W commented Sep 4, 2024

Doesn’t Dataset do this automatically? We shouldn’t let an unsanitised URI get into the system in the first place.

uri: str = attr.field(
converter=_sanitize_uri,
validator=[attr.validators.min_len(1), attr.validators.max_len(3000)],
)

Yep, but new_datasets are DatasetModels instead of Dataset

@uranusjr
Copy link
Member

uranusjr commented Sep 4, 2024

I was thinking it’s weird we’re using DatasetModel here… (DatasetManager actually converts them back to Dataset for a user hook too)

I was having problems with this function when adding name too. Let me rewrite this entire function a bit.

@Lee-W
Copy link
Member Author

Lee-W commented Sep 4, 2024

I was thinking it’s weird we’re using DatasetModel here… (DatasetManager actually converts them back to Dataset for a user hook too)

I was having problems with this function when adding name too. Let me rewrite this entire function a bit.

I think the main reason is that we need to check the stored dataset models. Do you want me to continue working on this one? Or wait for your PR? Without it, some edges cases might fail

@uranusjr
Copy link
Member

uranusjr commented Sep 4, 2024

Hmm, wait, this still does not make sense even considering we’re using DatasetModel. d here is a Dataset, so d.uri should have already gone through _sanitize_uri when d is created. Is converter somehow not functioning the way it’s assumed to? Is there a bug somewhere else?

It’s probably best to come up with a test case here first to identify the actual issue. I don’t think calling _sanitize_uri is the correct thing to do, but it’s difficult to tell what the right thing is without an actual bug.

@Lee-W
Copy link
Member Author

Lee-W commented Sep 4, 2024

Hmm, wait, this still does not make sense even considering we’re using DatasetModel. d here is a Dataset, so d.uri should have already gone through _sanitize_uri when d is created. Is converter somehow not functioning the way it’s assumed to? Is there a bug somewhere else?

It’s probably best to come up with a test case here first to identify the actual issue. I don’t think calling _sanitize_uri is the correct thing to do, but it’s difficult to tell what the right thing is without an actual bug.

https://github.com/astronomer/astro-sdk/blob/main/python-sdk/example_dags/example_datasets.py

This is the DAG that fails and yep, I think the solution is incorrect and that's why I convert it to draft again

@Lee-W
Copy link
Member Author

Lee-W commented Sep 4, 2024

I think I found the root cause. will push in short

@Lee-W Lee-W force-pushed the fix-missing-key-in-stored-datasets branch from 40f2e0a to 2c7c51f Compare September 4, 2024 10:59
@Lee-W Lee-W changed the title fix(dag): fix missing key within stored_datasets and outlet_references, we should use sanitized uri instead Rewrite how dag to dataset / dataset alias are stored Sep 4, 2024
@Lee-W Lee-W force-pushed the fix-missing-key-in-stored-datasets branch from 2c7c51f to 8adc2c3 Compare September 4, 2024 11:02
@Lee-W Lee-W marked this pull request as ready for review September 4, 2024 11:03
@Lee-W
Copy link
Member Author

Lee-W commented Sep 4, 2024

Hmm, wait, this still does not make sense even considering we’re using DatasetModel. d here is a Dataset, so d.uri should have already gone through _sanitize_uri when d is created. Is converter somehow not functioning the way it’s assumed to? Is there a bug somewhere else?

It’s probably best to come up with a test case here first to identify the actual issue. I don’t think calling _sanitize_uri is the correct thing to do, but it’s difficult to tell what the right thing is without an actual bug.

This was due to some dataset/dataset alias subclass might not be hashable. I rewrote how the dag ref was handled.

@Lee-W Lee-W closed this Sep 5, 2024
@uranusjr
Copy link
Member

uranusjr commented Sep 5, 2024

If this is already passing, maybe we can also try to merge it first and do the related changes in a different PR. Do you think this is ready to be merged?

@uranusjr uranusjr reopened this Sep 5, 2024
@Lee-W
Copy link
Member Author

Lee-W commented Sep 5, 2024

If this is already passing, maybe we can also try to merge it first and do the related changes in a different PR. Do you think this is ready to be merged?

Yep, I think this one is ready to be merged

…es in bulk_write_to_db

some dataset model variables were named as datasets which could be confusing
it used to save the whole object in a set, but in some edge cases, a customized dataset might not be hashasble and thus will fail
@Lee-W Lee-W force-pushed the fix-missing-key-in-stored-datasets branch from 8adc2c3 to 87b5ad1 Compare September 6, 2024 00:46
Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit (actually two but on the same topic), looks good to me.

@uranusjr
Copy link
Member

uranusjr commented Sep 6, 2024

Meh, I’m going to just merge this since I have plans to refactor the entire function anyway. I can change the thing I want when that happens.

@uranusjr uranusjr merged commit daef5c5 into apache:main Sep 6, 2024
51 checks passed
@uranusjr uranusjr deleted the fix-missing-key-in-stored-datasets branch September 6, 2024 04:06
@Lee-W
Copy link
Member Author

Lee-W commented Sep 6, 2024

Thanks! @uranusjr I think this fix is needed for 2.10 so I'm going to backport now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants