Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data versioning YAML API via catalog.yml #2662

Closed
Christophe-pere opened this issue Jun 8, 2023 · 4 comments
Closed

Data versioning YAML API via catalog.yml #2662

Christophe-pere opened this issue Jun 8, 2023 · 4 comments

Comments

@Christophe-pere
Copy link

Description

Hi,
I tried creating versions for the dataset I used in an ML experiment, but I got an error following the tutorial documentation.

Context

I created a pipeline to benchmark different models in the same flow structure. Everything work and I can track the metrics in the kedro-viz environment. When I tried to version the dataset as mentioned in the documentation:

https://kedro-mlflow.readthedocs.io/en/stable/source/07_python_objects/01_DataSets.html

I got a VersionNotFoundError, which stopped the pipeline.

I also tried via kedro-mlflow via the YAML API, the pipeline runs, but nothing is stored or versioned.

Steps to Reproduce

Lines in the YAML catalog.yml file.

dataset:
  type: pandas.CSVDataSet
  filepath: data/01_raw/dataset.csv
  versioned: true

I then ran the pipeline with the command:

kedro run

I also tried with kedro-mlflow plugin and wrote:

dataset:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet 
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: data/01_raw/dataset.csv # must be a local file, wherever you want to log the data in the end

And ran the same command line. The pipeline runs, but nothing is stored.

Expected Result

Dataset versioning.

Actual Result

I got the below error with the versioned: true parameter.

VersionNotFoundError: Did not find any versions for CSVDataSet(filepath=/path/to/data/01_raw/dataset.csv, load_args={}, 
protocol=file, save_args={'index': False}, version=Version(load=None, save='2023-06-08T18.43.35.291Z'))

Currently, I'm pretty sure that the error comes from my side, I think I missed something.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • MacBook Pro M1
  • Python: 3.9.6
  • Kedro Version: 0.18.8
  • MacOS: Ventura 13.5
@astrojuanlu
Copy link
Member

Hello @Christophe-pere , sorry for the delay! Could you please raise this issue over https://github.com/Galileo-Galilei/kedro-mlflow? cc @Galileo-Galilei

@Christophe-pere
Copy link
Author

Hi,

In fact, I just fixed this issue.

To be able to version a dataset, it requires a catalog the catalog.yml wasn't enough.

Based on the blog https://waylonwalker.com/kedro-incremental-versioned-datasets/
I found a precious command

kedro catalog create --pipeline <name of your pipeline>

This generates a new folder with a new yaml file name_of_your_pipeline.yml in it you can change the MemoryDataset (default value) by what you want:

encoded_my_dataset:
  type: pandas.CSVDataSet 
  filepath: data/01_raw/versions/encoded_my_dataset
  versioned: true

And now, the preprocessed data will be versioned. I haven't found this command in the documentation or it was unclear to me. But now, it's working.

@astrojuanlu
Copy link
Member

xref #2604

@Galileo-Galilei
Copy link
Member

Galileo-Galilei commented Jun 21, 2023

Regarding kedro-mlflow, I am very confused by the following sentence :

I also tried with kedro-mlflow plugin and wrote:

dataset:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet 
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: data/01_raw/dataset.csv # must be a local file, wherever you want to log the data in the end

And ran the same command line. The pipeline runs, but nothing is stored.

It likely either store the data but not where you want (and certainly not the data/ folder) or do not store your dataset because it is not saved inside a pipeline (i.e. an output of a node). Since the issue seems solved, I won't dig further but do not hesitate to ask if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants