Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MNIST dataset to the dataset-registry for the new example-get-started #27

Closed
iesahin opened this issue Apr 12, 2021 · 4 comments
Closed

Comments

@iesahin
Copy link
Contributor

iesahin commented Apr 12, 2021

I started to write the experiments tutorial using the original MNIST from deepai.org:

Currently I'm using a Google Storage bucket as a remote:

https://console.cloud.google.com/storage/browser/dvc-example-data

After prepare and preprocess, the data dir is like:

[I] ➜ tree data
data
├── prepared
│   ├── mnist-test.npz
│   └── mnist-train.npz
├── preprocessed
│   ├── mnist-test.npz
│   └── mnist-train.npz
├── raw
│   ├── t10k-images-idx3-ubyte.gz
│   ├── t10k-labels-idx1-ubyte.gz
│   ├── train-images-idx3-ubyte.gz
│   └── train-labels-idx1-ubyte.gz
└── raw.dvc
  • There are 4 raw data files. These are better served with dvc import-url or dvc get instead of using a common remote, I think. Also instead of adding them one by one, I decided to add raw/ directory to DVC.

  • prepared/ directory contains mixed, and shuffled and remixed data, preprocess/ contains normalized and (optionally) noise-added data. These are all steps in the pipeline. I plan to use tracking directories instead of individual files in these as well.

  • There may be multiple models (one MLP, one CNN, another deep CNN ... ) that can be selected in params.yaml by train.py. I think it overcomplicates the pipeline. (Which files will the train.py depend?) However it can show the experimentation features more clearly. "We select this model with this amount of salt and pepper noise and it gives us this result."

  • I plan to write individual model files for different models. Some of them may not run on Katacoda, but overall there will more parameters for the users to try.

In summary, I need some way to put the data files to a public place. I can use the above URL or you may want to keep the data at the current one: Which way would you recommend?

@shcheklein @dberenbaum

This is related with iterative/dvc.org#1400 iterative/katacoda-scenarios#60 and probably many others :)

@iesahin iesahin changed the title Include MNIST dataset in the dataset-registry for new example-get-started Add MNIST dataset in the dataset-registry for the new example-get-started Apr 12, 2021
@iesahin iesahin changed the title Add MNIST dataset in the dataset-registry for the new example-get-started Add MNIST dataset to the dataset-registry for the new example-get-started Apr 12, 2021
@shcheklein
Copy link
Member

@iesahin let's add raw data to the existing data registry.

let's push processed, models etc to the public (read) remote on S3, similar to the one we use for the example-get-started. I can five you access to S3 to push data there.

@shcheklein
Copy link
Member

A preliminary params.yaml looks like this:

is it related to this ticket?

@iesahin
Copy link
Contributor Author

iesahin commented Apr 13, 2021

is it related to this ticket?

Ah, right, moved to the other. Thanks. 😄

@iesahin
Copy link
Contributor Author

iesahin commented Apr 21, 2021

This is closed by iterative/dataset-registry#7

@iesahin iesahin closed this as completed Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants