Skip to content
This repository has been archived by the owner on Jul 5, 2022. It is now read-only.

Maintain code / data files in scenarios from a central repository and bucket #44

Closed
iesahin opened this issue Mar 11, 2021 · 5 comments
Closed

Comments

@iesahin
Copy link
Contributor

iesahin commented Mar 11, 2021

Currently, scenario initialization scripts replay the previous steps in to obtain a project. It initializes Git and DVC, downloads the data, splits it, prepares it, etc. Instead of this, we can package the end of scenario result in a single zip file, download and extract it in a single step. Because of the data, we cannot put all to the assets, it has 1MB size limit.

Is there a central location (in S3, perhaps), that I can use to upload / download these zip files? I'll put a zipped DVC repository for each of the scenarios.

@jorgeorpinel @shcheklein

@jorgeorpinel
Copy link
Contributor

Yes, we have an AWS S3 bucket (used as public remote for https://github.com/iterative/example-get-started for example). But what's the benefit of this proposed approach, to make scenario preparation faster?

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 11, 2021

we cannot put all to the assets

What other assets do we need to upload other than raw data (already on S3), prepared data, and model? Everything else should already be in Git I think. Thanks

@iesahin iesahin changed the title Download data and assets as a single zip file from S3 Maintain code / data files in scenarios from a central repository and bucket Mar 12, 2021
@iesahin
Copy link
Contributor Author

iesahin commented Mar 12, 2021

prepare.sh script replays the previous scenarios to reach the beginning of a scenario. This is not a big deal in the first two scenarios, but for the later scenarios, it takes time. A ready-made DVC repository is quicker. The user is already waiting for DVC and requirements installations to complete and I would like them to wait for less for the start.

You can see the difference in https://katacoda.com/dvc/courses/get-started/experiments and https://katacoda.com/dvc/courses/get-started/params-metrics-plots.

If you can put

https://one.emresult.com/~iex/project-experiments.zip

to a bucket and let me know the URL, I'll change it.

There is a limit of 1MB for the asset files in Katacoda. The code can be put into the assets, in the form of a DVC repository and it can pull the data files. In that case, we need to upload the data files only.

@shcheklein said the maintenance of these code/project files may become an issue. It's possible to use DVC itself to manage the data files but this must be fast. DVC installation takes some time and the time spent on preparation should be minimal.

If we prefer to manage the data with DVC (I do), some structural changes needed in the code files. At least they should use dvc import instead of add to track another data repository. Data in Katacoda is shrunk because of the limitations. Maybe a branch in https://github.com/iterative/example-get-started is enough to keep the track of code and an S3 bucket for the remote cache of data files.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Mar 14, 2021

All the pipeline data should be up on the example-get-started repo remote (see all the dvc pushes in generate.sh).

If we prefer to manage the data with DVC (I do)

You can checkout the corresponding tag and use dvc pull (instead of replaying the whole process). Have you tried that?

DVC installation takes some time

But you have to install dvc for the scenario(s) in question, anyway right?

@iesahin
Copy link
Contributor Author

iesahin commented Apr 6, 2021

After dockerization of the scenarios, it's now not required to use such a facility. The containers already include all needed assets and code.

Related: #49

@jorgeorpinel @shcheklein

@iesahin iesahin closed this as completed Apr 6, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants