Data versioning YAML API via catalog.yml #2662

Christophe-pere · 2023-06-08T19:15:20Z

Description

Hi,
I tried creating versions for the dataset I used in an ML experiment, but I got an error following the tutorial documentation.

Context

I created a pipeline to benchmark different models in the same flow structure. Everything work and I can track the metrics in the kedro-viz environment. When I tried to version the dataset as mentioned in the documentation:

https://kedro-mlflow.readthedocs.io/en/stable/source/07_python_objects/01_DataSets.html

I got a VersionNotFoundError, which stopped the pipeline.

I also tried via kedro-mlflow via the YAML API, the pipeline runs, but nothing is stored or versioned.

Steps to Reproduce

Lines in the YAML catalog.yml file.

dataset:
  type: pandas.CSVDataSet
  filepath: data/01_raw/dataset.csv
  versioned: true

I then ran the pipeline with the command:

kedro run

I also tried with kedro-mlflow plugin and wrote:

dataset:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet 
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: data/01_raw/dataset.csv # must be a local file, wherever you want to log the data in the end

And ran the same command line. The pipeline runs, but nothing is stored.

Expected Result

Dataset versioning.

Actual Result

I got the below error with the versioned: true parameter.

VersionNotFoundError: Did not find any versions for CSVDataSet(filepath=/path/to/data/01_raw/dataset.csv, load_args={}, 
protocol=file, save_args={'index': False}, version=Version(load=None, save='2023-06-08T18.43.35.291Z'))

Currently, I'm pretty sure that the error comes from my side, I think I missed something.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

MacBook Pro M1
Python: 3.9.6
Kedro Version: 0.18.8
MacOS: Ventura 13.5

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2023-06-15T13:36:36Z

Hello @Christophe-pere , sorry for the delay! Could you please raise this issue over https://github.com/Galileo-Galilei/kedro-mlflow? cc @Galileo-Galilei

Christophe-pere · 2023-06-16T01:58:07Z

Hi,

In fact, I just fixed this issue.

To be able to version a dataset, it requires a catalog the catalog.yml wasn't enough.

Based on the blog https://waylonwalker.com/kedro-incremental-versioned-datasets/
I found a precious command

kedro catalog create --pipeline <name of your pipeline>

This generates a new folder with a new yaml file name_of_your_pipeline.yml in it you can change the MemoryDataset (default value) by what you want:

encoded_my_dataset:
  type: pandas.CSVDataSet 
  filepath: data/01_raw/versions/encoded_my_dataset
  versioned: true

And now, the preprocessed data will be versioned. I haven't found this command in the documentation or it was unclear to me. But now, it's working.

astrojuanlu · 2023-06-20T12:58:51Z

xref #2604

Galileo-Galilei · 2023-06-21T20:33:26Z

Regarding kedro-mlflow, I am very confused by the following sentence :

I also tried with kedro-mlflow plugin and wrote:

dataset:
    type: kedro_mlflow.io.artifacts.MlflowArtifactDataSet 
    data_set:
        type: pandas.CSVDataSet  # or any valid kedro DataSet
        filepath: data/01_raw/dataset.csv # must be a local file, wherever you want to log the data in the end

And ran the same command line. The pipeline runs, but nothing is stored.

It likely either store the data but not where you want (and certainly not the data/ folder) or do not store your dataset because it is not saved inside a pipeline (i.e. an output of a node). Since the issue seems solved, I won't dig further but do not hesitate to ask if needed.

Christophe-pere closed this as completed Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data versioning YAML API via catalog.yml #2662

Data versioning YAML API via catalog.yml #2662

Christophe-pere commented Jun 8, 2023

astrojuanlu commented Jun 15, 2023

Christophe-pere commented Jun 16, 2023

astrojuanlu commented Jun 20, 2023

Galileo-Galilei commented Jun 21, 2023 •

edited

Loading

Data versioning YAML API via catalog.yml #2662

Data versioning YAML API via catalog.yml #2662

Comments

Christophe-pere commented Jun 8, 2023

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment

astrojuanlu commented Jun 15, 2023

Christophe-pere commented Jun 16, 2023

astrojuanlu commented Jun 20, 2023

Galileo-Galilei commented Jun 21, 2023 • edited Loading

Galileo-Galilei commented Jun 21, 2023 •

edited

Loading