-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lightweight serialization for deltalake tables #35462
Conversation
|
||
|
||
def serialize(o: object) -> tuple[U, str, int, bool]: | ||
from deltalake.table import DeltaTable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle versioning of the underlying third-party libraries like deltalake here and iceberg somewhere else?
We should add or update some docs too on what serializers are available natively
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean if the API changes? We mostly assume the API remains relatively stable for this kind of work (as we do with pandas and others) and we do not tie users to a particular version at the moment for Iceberg and DeltaLake as it is not core to Airflow neither in one of the providers. If we do move some serializers to providers , which is probably the right thing to do, then it makes more sense improve this.
Docs yes, but I think having that with the rework into providers is more logical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do move some serializers to providers , which is probably the right thing to do, then it makes more sense improve this.
Yeah we will need to handle that sooner. delta for example, is still on 0.x (https://pypi.org/project/deltalake/#history) so in theory can be broken if they follow SemVer
Docs yes, but I think having that with the rework into providers is more logical.
Are you planning to do that soon before we release the next minor Airflow version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, on the other hand we rely only very lightly on the API, just enough to be able to re-instantiate it, which is typically getting the right data to pass to __init__
Not sure if I get it in for a next minor release. Provider re-work isn't much fun :-). And this serialization, while it works, is not the final work. I was thinking about doing something like airflow.catalog
, but I am not sure about that yet. In addition I am thinking a about Arrow, maybe that should be our default
format for this. Again, not sure and I need to play with it a bit more. Hence, for now, the lightweight part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
This adds support for deltalake table serialization.
It allows you to do the following:
cc: @hussein-awala
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.