Maintain code / data files in scenarios from a central repository and bucket #44

iesahin · 2021-03-11T10:08:40Z

Currently, scenario initialization scripts replay the previous steps in to obtain a project. It initializes Git and DVC, downloads the data, splits it, prepares it, etc. Instead of this, we can package the end of scenario result in a single zip file, download and extract it in a single step. Because of the data, we cannot put all to the assets, it has 1MB size limit.

Is there a central location (in S3, perhaps), that I can use to upload / download these zip files? I'll put a zipped DVC repository for each of the scenarios.

@jorgeorpinel @shcheklein

jorgeorpinel · 2021-03-11T19:00:55Z

Yes, we have an AWS S3 bucket (used as public remote for https://github.com/iterative/example-get-started for example). But what's the benefit of this proposed approach, to make scenario preparation faster?

jorgeorpinel · 2021-03-11T19:02:07Z

we cannot put all to the assets

What other assets do we need to upload other than raw data (already on S3), prepared data, and model? Everything else should already be in Git I think. Thanks

iesahin · 2021-03-12T14:19:48Z

prepare.sh script replays the previous scenarios to reach the beginning of a scenario. This is not a big deal in the first two scenarios, but for the later scenarios, it takes time. A ready-made DVC repository is quicker. The user is already waiting for DVC and requirements installations to complete and I would like them to wait for less for the start.

You can see the difference in https://katacoda.com/dvc/courses/get-started/experiments and https://katacoda.com/dvc/courses/get-started/params-metrics-plots.

If you can put

https://one.emresult.com/~iex/project-experiments.zip

to a bucket and let me know the URL, I'll change it.

There is a limit of 1MB for the asset files in Katacoda. The code can be put into the assets, in the form of a DVC repository and it can pull the data files. In that case, we need to upload the data files only.

@shcheklein said the maintenance of these code/project files may become an issue. It's possible to use DVC itself to manage the data files but this must be fast. DVC installation takes some time and the time spent on preparation should be minimal.

If we prefer to manage the data with DVC (I do), some structural changes needed in the code files. At least they should use dvc import instead of add to track another data repository. Data in Katacoda is shrunk because of the limitations. Maybe a branch in https://github.com/iterative/example-get-started is enough to keep the track of code and an S3 bucket for the remote cache of data files.

jorgeorpinel · 2021-03-14T00:51:43Z

All the pipeline data should be up on the example-get-started repo remote (see all the dvc pushes in generate.sh).

If we prefer to manage the data with DVC (I do)

You can checkout the corresponding tag and use dvc pull (instead of replaying the whole process). Have you tried that?

DVC installation takes some time

But you have to install dvc for the scenario(s) in question, anyway right?

iesahin · 2021-04-06T12:10:54Z

After dockerization of the scenarios, it's now not required to use such a facility. The containers already include all needed assets and code.

Related: #49

@jorgeorpinel @shcheklein

iesahin changed the title ~~Download data and assets as a single zip file from S3~~ Maintain code / data files in scenarios from a central repository and bucket Mar 12, 2021

iesahin closed this as completed Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintain code / data files in scenarios from a central repository and bucket #44

Maintain code / data files in scenarios from a central repository and bucket #44

iesahin commented Mar 11, 2021

jorgeorpinel commented Mar 11, 2021

jorgeorpinel commented Mar 11, 2021 •

edited

Loading

iesahin commented Mar 12, 2021

jorgeorpinel commented Mar 14, 2021 •

edited

Loading

iesahin commented Apr 6, 2021

Maintain code / data files in scenarios from a central repository and bucket #44

Maintain code / data files in scenarios from a central repository and bucket #44

Comments

iesahin commented Mar 11, 2021

jorgeorpinel commented Mar 11, 2021

jorgeorpinel commented Mar 11, 2021 • edited Loading

iesahin commented Mar 12, 2021

jorgeorpinel commented Mar 14, 2021 • edited Loading

iesahin commented Apr 6, 2021

jorgeorpinel commented Mar 11, 2021 •

edited

Loading

jorgeorpinel commented Mar 14, 2021 •

edited

Loading