Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

list: document usage for data export/archive #1521

Closed
jorgeorpinel opened this issue Oct 22, 2019 · 19 comments
Closed

list: document usage for data export/archive #1521

jorgeorpinel opened this issue Oct 22, 2019 · 19 comments
Labels
A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 22, 2019

For projects created with dvc init --no-scm, since there's no Git repo to version all the files NOT tracked by DVC (code, DVC-files), it could be useful to have a dvc export <external-location> command to easily create a lightweight copy of the project (for backup). It's "lightweight" because it wouldn't include any of the data tracked by DVC.

Similar to git archive. Could even include an --archive flag to make a tar/zip bundle of the export.

In the future, similar bundling/compressing functionality for actual data sets could be reused.

Just a random idea! (It came from reading some conversations about non-Git projects on Discord.)

UPDATE: To see the latest discussion go to #1521 (comment), but in summary:

Don't need a new command for now, just document this to archive a snapshot:

git archive -o code.zip HEAD
dvc list . -R --dvc-only | zip -@ data.zip  # if `zip` available
dvc list . -R --dvc-only | xargs python -m zipfile -c data.zip  # alternative for windows (assuming `xargs` available)
@efiop
Copy link
Contributor

efiop commented Oct 24, 2019

@jorgeorpinel User can do that with tar or zip no problem. Wouldn't bother with this until someone asks for this functionality and has good reasons why he can't use tar or zip 🙂 There is a special reason for git archive -- it excludes .git directory, which is somewhat useful. For us there is nothing we should exclude, so it would be the same as running tar or zip on the whole directory. Unless I'm missing something here.

@jorgeorpinel
Copy link
Contributor Author

For us there is nothing we should exclude, so it would be the same as running tar or zip...

What about huge data files? I'm talking about a lightweight copy of the repo as if you just cloned it with Git (but for --no-scm projects). A backup of the DVC project without data files.

@efiop
Copy link
Contributor

efiop commented Oct 24, 2019

@jorgeorpinel Ah, got it. Yes, I can now see the value there. Thanks for clarifying! 🙂 Please feel free to raise the priority if you need this feature, otherwise I would probably wait until there is a clear useful scenario in which someone will actually use this.

@jorgeorpinel
Copy link
Contributor Author

No problem. Yes, I agree maybe no one really needs this haha. p3 seems correct, maybe no p at all for now. Let's wait and see. Thanks!

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Oct 25, 2019

p.s. another variant of this feature could be something like dvc clear to delete all linked data files from the workspace, thus producing a lightweight project (except for the cache dir.) This could easily be reverted again by dvc fetch. Let's see if anyone ever wants this in the future. 😋

@dashohoxha
Copy link
Contributor

another variant of this feature could be something like dvc clear to delete all linked data files from the workspace

Another alternative solution might also be provided by dvc list (when it is implemented).
If we can list all the files that are managed by DVC, then it is possible to exclude them while making the archive.

@shcheklein shcheklein changed the title dvc: new export or archive command? new export or archive command? Nov 6, 2019
@efiop
Copy link
Contributor

efiop commented Nov 26, 2019

We've received a very similar question from a user https://opendatascience.slack.com/archives/CGGLZJ119/p1574762045023000?thread_ts=1574761369.020000&cid=CGGLZJ119 (russian-only, sorry 🙁 ). Long story short, the guy is creating an arcive with code and data to send to the customer, which is very similar to the idea from @jorgeorpinel described above. I've asked him to leave a comment here too.

@RomanSteinberg
Copy link

Hi,
I'm the guy from the previous comment. I think, dvc archive is a bad idea. I have a lot of ignored files in my repo and I think it is not a dvc's responsibility to clear them. For example, I have .git, .dvc, .idea in my local repo folder. So, if dvc exports all artifacts due to my local repo it will not remove all those ignored files. Second disadvantage of this approach is that I would have to sort which artifacts I need to have in the archive and which not.

So, I would like to have something like dvc clear which replaces all links with original binaries stored in dvc and removes all dvc-files. I mean dvc clears only what it responsible for. It will be good to have it this way.

@efiop
Copy link
Contributor

efiop commented Nov 26, 2019

@RomanSteinberg Thanks for your comment!

I'm pretty sure git archive will remove .git and gitignored files, leaving only the ones that are actually tracked by it. We were thinking we could work the same as that, plus also remove .dvc/, *.dvc files and replace them with actual data.

Speaking about dvc clear, I think that we should make dvc destroy do those actions (we actually
have a ticket for it with that precise proposed functionality 🙂 )

@jorgeorpinel jorgeorpinel changed the title new export or archive command? New export or archive command? Nov 26, 2019
@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Nov 26, 2019

Thanks guys! There's some confusion though, my original idea here is for DVC projects that DO NOT use a Git repository as base. There would be no .git/ dir or .gitignore file. dvc export would create a copy without the cache directory or any of the outputs linked in DVC-files, leaving everything else as is including .dvc/ dir and DVC-files. (Also keeping any hidden stuff like .idea/, etc. which is not DVC's responsibility indeed.)

git archive will remove .git and gitignore files, leaving only the ones that are actually tracked by it. We were thinking we could work the same as that, plus also remove .dvc/, *.dvc files and replace them with actual data.

@efiop actually I was not thinking to include the tracked data in the export. So maybe dvc export is not the best name as the Git analogy breaks... In fact I don't mean to remove DVC from the export, just to make a lightweight copy! So perhaps closer to a dvc clone?

Maybe the idea of dvc clear can replace all this though, and work both for Git and non-Git DVC projects. Let's see: So basically it would move all cached objects to the workspace and delete .dit/ and cache dir, as well as all DVC-files @RomanSteinberg? Kind of like DVC removing itself from the project, the opposite of dvc init?

Seems a bit risky to me, TBH. You would lose any outputs from other Git versions (not linked in checked out DVC-files) and if there's no other copy of the project, its gone forever.

@jorgeorpinel
Copy link
Contributor Author

So to summarize, we're basically talking about 2 different things:

  1. Create a lightweight copy of the project (without cache or tracked data). Useful for projects that don't use Git only because when you do use Git, .gitignore basically takes care of this.
  2. Remove DVC from a project (opposite of dvc init): I like this also but it should probably be a separate issue and I suggest it's done in an exported copy by default, not to risk losing possibly the only DVC project copy.

@RomanSteinberg
Copy link

@jorgeorpinel I couldn't imagine that someone can use dvc without git. I don't understand this case at all. How can one versioning data and not versioning code? So, I can't give any feedback about your idea.

@jorgeorpinel
Copy link
Contributor Author

In fact without Git you could not version the data either. But still we offer dvc init --no-scm to provide the pipeline management functionality (without versioning). I think. I didn't decide this, but the fact is we have the option.

@casperdcl
Copy link
Contributor

casperdcl commented Jul 2, 2020

Just to clarify, we could have dvc list which could either list:

  • data files tracked by DVC, or
  • the DVC-files

Then dvc export/archive would actually bundle either one of those into a folder/zip.

Use case: deploy releases without SCM/DVC: git archive -o code.zip && dvc archive -o data.zip for upload to customer/Zenodo/publication etc.

I'd agree that dvc list would be a first step. Writing an archive script would be easy after that.

@efiop
Copy link
Contributor

efiop commented Jul 2, 2020

For the record: dvc list is already implemented.

@casperdcl
Copy link
Contributor

iterative/dvc#4108 I see :)

@casperdcl
Copy link
Contributor

Right so to archive a snapshot:

git archive -o code.zip HEAD
dvc list . -R --dvc-only | zip -@ data.zip

@jorgeorpinel
Copy link
Contributor Author

Perfect! Maybe we just need to put a note about this in the dvc list cmd ref. and link from a few more places? If so please move this issue to the docs repo. Thanks

@casperdcl casperdcl transferred this issue from iterative/dvc Jul 2, 2020
@casperdcl casperdcl changed the title New export or archive command? list: document usage for data export/archive Jul 2, 2020
@shcheklein shcheklein added A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions labels Jul 5, 2020
jorgeorpinel added a commit that referenced this issue Jan 6, 2021
@jorgeorpinel
Copy link
Contributor Author

Closed by #2075.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants