Add MNIST dataset to the dataset-registry for the new example-get-started #27

iesahin · 2021-04-12T11:15:42Z

I started to write the experiments tutorial using the original MNIST from deepai.org:

Currently I'm using a Google Storage bucket as a remote:

https://console.cloud.google.com/storage/browser/dvc-example-data

After prepare and preprocess, the data dir is like:

[I] ➜ tree data
data
├── prepared
│   ├── mnist-test.npz
│   └── mnist-train.npz
├── preprocessed
│   ├── mnist-test.npz
│   └── mnist-train.npz
├── raw
│   ├── t10k-images-idx3-ubyte.gz
│   ├── t10k-labels-idx1-ubyte.gz
│   ├── train-images-idx3-ubyte.gz
│   └── train-labels-idx1-ubyte.gz
└── raw.dvc

There are 4 raw data files. These are better served with dvc import-url or dvc get instead of using a common remote, I think. Also instead of adding them one by one, I decided to add raw/ directory to DVC.
prepared/ directory contains mixed, and shuffled and remixed data, preprocess/ contains normalized and (optionally) noise-added data. These are all steps in the pipeline. I plan to use tracking directories instead of individual files in these as well.
There may be multiple models (one MLP, one CNN, another deep CNN ... ) that can be selected in params.yaml by train.py. I think it overcomplicates the pipeline. (Which files will the train.py depend?) However it can show the experimentation features more clearly. "We select this model with this amount of salt and pepper noise and it gives us this result."
I plan to write individual model files for different models. Some of them may not run on Katacoda, but overall there will more parameters for the users to try.

In summary, I need some way to put the data files to a public place. I can use the above URL or you may want to keep the data at the current one: Which way would you recommend?

@shcheklein @dberenbaum

This is related with iterative/dvc.org#1400 iterative/katacoda-scenarios#60 and probably many others :)

The text was updated successfully, but these errors were encountered:

shcheklein · 2021-04-12T23:48:39Z

@iesahin let's add raw data to the existing data registry.

let's push processed, models etc to the public (read) remote on S3, similar to the one we use for the example-get-started. I can five you access to S3 to push data there.

shcheklein · 2021-04-13T00:00:19Z

A preliminary params.yaml looks like this:

is it related to this ticket?

iesahin · 2021-04-13T06:05:17Z

is it related to this ticket?

Ah, right, moved to the other. Thanks. 😄

iesahin · 2021-04-21T10:16:02Z

This is closed by iterative/dataset-registry#7

iesahin changed the title ~~Include MNIST dataset in the dataset-registry for new example-get-started~~ Add MNIST dataset in the dataset-registry for the new example-get-started Apr 12, 2021

iesahin changed the title ~~Add MNIST dataset in the dataset-registry for the new example-get-started~~ Add MNIST dataset to the dataset-registry for the new example-get-started Apr 12, 2021

iesahin closed this as completed Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MNIST dataset to the dataset-registry for the new example-get-started #27

Add MNIST dataset to the dataset-registry for the new example-get-started #27

iesahin commented Apr 12, 2021 •

edited

Loading

shcheklein commented Apr 12, 2021

shcheklein commented Apr 13, 2021

iesahin commented Apr 13, 2021

iesahin commented Apr 21, 2021

Add MNIST dataset to the dataset-registry for the new example-get-started #27

Add MNIST dataset to the dataset-registry for the new example-get-started #27

Comments

iesahin commented Apr 12, 2021 • edited Loading

shcheklein commented Apr 12, 2021

shcheklein commented Apr 13, 2021

iesahin commented Apr 13, 2021

iesahin commented Apr 21, 2021

iesahin commented Apr 12, 2021 •

edited

Loading