-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to save and load from versioned ManagedTableDatasets #920
Comments
Hey @jstammers, thanks for raising this issue! The |
@MinuraPunchihewa Do you have an idea? I am actually confused that We can probably implement this for a specific dataset, but before that can I understand a bit more the use case here? Do you plan to use this as the example you stated or you are just checking if it's versioned? |
@noklam I am happy to take a look at this. About you comment on |
@noklam yes, the use case I have in mind is to be able to load the previous version of a delta table, so that I can perform some validation of the changes to the table after updating it. As a pipeline, it would look something like pipeline = Pipeline([
node(update_table, inputs=["table", "staging_table"], outputs="updated_table"),
node(validate_changes, inputs=["table", "updated_table"], outputs="changes")
]) where As for inferring the version number, I think the simplest way to do that is to use the following spark SQL statement current_version = spark.sql("DESCRIBE HISTORY <catalog>.<database>.<table>").select("version").first()[0] |
I'd be looking at some PoC to play with Iceberg and versioning and may come back to this a little bit. @jstammers The other options is do this validation with hook instead of a node (nothing wrong with the current approach as well). How does the node generate the delta change? I see that the nodes has two inputs and split out the "changes" as output. Is this some kind of incremental pipeline? |
From my understanding, it was designed for filebase data. Version is a class that takes There are couple of requirements here:
Take this example: my_data:
type: some.FileDataset
path: my_folder/abc.file
versioned: true This is expected to save file as
Noted that Cc @merelcht |
Description
I am trying to make use of a versioned
ManagedTableDataset
so that I can correctly load and save using different versions of a delta table. I'm encountering an error when trying to load from a catalog, because the version is incorrectly configured for a delta tableContext
How has this bug affected you? What were you trying to accomplish?
Steps to Reproduce
Expected Result
When creating these datasets, I expect that the load version numbers should be resolved from the current version, the specified version number or 0 if the table doesn't exist.
When calling
ManagedTableDataset.save
, the load and save version numbers should be incremented accordinglyActual Result
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
pip show kedro
orkedro -V
): 0.19.9pip show kedro-airflow
): kedro-datasets - 5.1.0python -V
): 3.10.12The text was updated successfully, but these errors were encountered: