Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import-url: allow downloading directories #1861

Closed
tivvit opened this issue Apr 9, 2019 · 17 comments
Closed

import-url: allow downloading directories #1861

tivvit opened this issue Apr 9, 2019 · 17 comments
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important

Comments

@tivvit
Copy link

tivvit commented Apr 9, 2019

It is possible to add directory (dvc add ...) but it does not work for dvc import.

I tried the import for s3 remote and it is possible for a single file but not for a directory. I think it should be possible to list files and download them (it may be slow) or maybe download it as a zip and unzip it (it is supported in Minio). Are you willing to support that feature? I may prepare the support for s3 remote (local and ssh remotes should be quite simple too). It may be forced with some option like -r which may not be supported for some remotes (http)

@efiop efiop transferred this issue from iterative/dvc.org Apr 10, 2019
@efiop
Copy link
Contributor

efiop commented Apr 10, 2019

Hi @tivvit !

For the record: Transferred this issue from dvc.org (our web site) repo to dvc (main dvc project repo).

Great suggestion! Could please elaborate on what you expect from that dvc import -r? Should it then add downloaded dir as a directory, or should it add every single file inside of it separately? It is basically how dvc add dir and dvc add -R dir differ:

#!/bin/bash
set -e
set -x

rm -rf myrepo
mkdir myrepo
cd myrepo
git init
dvc init

mkdir -p dir1/subdir
echo data > dir1/data
echo subdata > dir1/subdir/subdata
dvc add dir1
tree
# .
# ├── dir1
# │   ├── data
# │   └── subdir
# │       └── subdata
# └── dir1.dvc


mkdir -p dir2/subdir
echo data > dir2/data
echo subdata > dir2/subdir/subdata
dvc add -R dir2
tree
# .
# ├── dir1
# │   ├── data
# │   └── subdir
# │       └── subdata
# ├── dir1.dvc
# └── dir2
#     ├── data
#     ├── data.dvc
#     └── subdir
#         ├── subdata
#         └── subdata.dvc

Thanks,
Ruslan

@tivvit
Copy link
Author

tivvit commented Apr 10, 2019

Sorry for the misplacement.

I was thinking about the two possibilities before you asked. For my use cases the dvc add -R makes more sense. But when add supports both, then import should probably also support both.

Maybe you have some statistics what is used more? That can suggest which option should we implement first.

@efiop
Copy link
Contributor

efiop commented Apr 10, 2019

@tivvit No worries 🙂

We don't have exact numbers, but from my personal feel dvc add dir is far more often used than dvc add dir -R :) But still, implementing dvc import -R would require implementing dvc import for like 90%, so it is going to be either added right away or could be added later without any problems :) It is fine to only support s3 for now, if you wish to start with it. Let us know if you need any help 🙂 Thank you so much for looking into that!

@efiop efiop added the feature request Requesting a new feature label Apr 10, 2019
@tivvit
Copy link
Author

tivvit commented Apr 10, 2019

I will take a look at it and submit PR

@efiop
Copy link
Contributor

efiop commented Apr 10, 2019

@tivvit Thanks! ❤️ Btw, we have our chat at https://dvc.org/chat , feel free to join :) Also, check out our contribution guide at https://dvc.org/doc/user-guide/contributing .

@shcheklein
Copy link
Member

@tivvit could you elaborate a little bit on your dvc add -R use case, please? It's just a pretty rare feature to use, would love to learn more what kind of scenario does it fit. Thanks for helping us with this, btw 👍 .

@ghost ghost changed the title dvc import directory import: allow downloading directories Apr 10, 2019
@tivvit
Copy link
Author

tivvit commented Apr 23, 2019

I have started implementing it last week and I have also studied the architecture and the ideology. The outcome is here tivvit@4dd1d9a but it is not finished.

I have encountered some problems and I have questions:

  • Prefix search is possible - should I do it every time (like I do now)? I am asking because it is a huge change in API. So far user did dvc import s3://bucket/file and dvc imported the file but now it will import {"file", "file.tgz", "file.bak"}
    • It can be explicitly marked with a / or * at the end or it can autodetect if it is a directory (otherwise it will throw file does not exist error)
  • I need to edit the files which are tracked by dvc, so far it was one file which was added during the import - is there a clean way how to do it (I need to change the state from the storage class)? So far I have disabled the tracking so the import is not reproducible which is definitely not correct.

My use case (@shcheklein)

  • I want to use dvc for reproducibility, which it does great, but it forces me to change my workflow.
  • I have files on shared storage organized in folders and used by several projects.
  • I know that dvc ensures that the files are obtainable by dvc pull but the files are then usable only in one project (therefore they have to be on the storage in their original form = duplicates). I would like to do dvc import -r which will download and track the files. When someone else wants to download the files he will do dvc pull / dvc repro which will download missing files from the storage (from the original position). When files are added to the storage nothing happens. Only when dvc import is called again, new files are added to the project. I know this approach is not that safe like dvc (git like) approach when dvc is keeping the remote storage up-to-date with push and gc. But It makes more sense when the data is reused by several pipelines.

Off-topic notes / questions

  • In connection to my workflow I would love to have the possibility to upload the final products of the pipeline to the remote storage. I saw that it is supported but with third-party tools, and I would like to have it embedded to dvc when upload functionality is already present. Why is it not present? Is it because you prefer to track the files with cache mechanism and upload them like that? The disadvantage is the naming of the files on the storage via the hashes.
  • I would like to define pipeline by hand (yaml with named stages and dependencies - human readable). I would just run the stage by name and dvc would compute the DAG and run everything. Basically, this only means to have all (or some) of the stages which are now defined in multiple *.dvc files grouped to one file.
  • I would love to separate the file hashes to some "run report" file. This file will be used for repro but it will be kept separately from the pipeline definition which would then stay human readable (editable). That would make human interaction with the pipeline possible. The "run reports" may be shared (even outside the repository) because they will contain all information for the reproduction.

I am saying all this because I wanted to build a tool like this but then I found dvc and it is really similar to the approach I had in my head but some of the workflows are different. Therefore I am asking why some decisions were made (I think that many answers will be that dvc is totally focused to reproducibility and simple use) and if you are willing to support such use cases in the project?

@efiop
Copy link
Contributor

efiop commented Apr 23, 2019

Prefix search is possible - should I do it every time (like I do now)? I am asking because it is a huge change in API. So far user did dvc import s3://bucket/file and dvc imported the file but now it will import {"file", "file.tgz", "file.bak"}
It can be explicitly marked with a / or * at the end or it can autodetect if it is a directory (otherwise it will throw file does not exist error)

It should behave that way by default. You should use -R option to enable such behaviour, as we've discussed above.

I need to edit the files which are tracked by dvc, so far it was one file which was added during the import - is there a clean way how to do it (I need to change the state from the storage class)? So far I have disabled the tracking so the import is not reproducible which is definitely not correct.

Sorry, I'm not sure I follow. Could you elaborate please?

In connection to my workflow I would love to have the possibility to upload the final products of the pipeline to the remote storage. I saw that it is supported but with third-party tools, and I would like to have it embedded to dvc when upload functionality is already present. Why is it not present? Is it because you prefer to track the files with cache mechanism and upload them like that? The disadvantage is the naming of the files on the storage via the hashes.

It is simply not implemented yet. From import's perspective dvc import local s3://bucket/file is a valid command, it is just that our underlying logic doesn't yet support that.

I would like to define pipeline by hand (yaml with named stages and dependencies - human readable). I would just run the stage by name and dvc would compute the DAG and run everything. Basically, this only means to have all (or some) of the stages which are now defined in multiple *.dvc files grouped to one file.

Yes, writing dvc files by hand is a totally valid usage.

I would love to separate the file hashes to some "run report" file. This file will be used for repro but it will be kept separately from the pipeline definition which would then stay human readable (editable). That would make human interaction with the pipeline possible. The "run reports" may be shared (even outside the repository) because they will contain all information for the reproduction.

Dvcfiles are in simple yaml format and are intended to be human-readable and editable. We've discussed creating a separate place for checksums, but decided that it would only pollute the workspace and make everything much more obscure for the user, especially when merging is involved.

@efiop efiop self-assigned this May 16, 2019
@efiop efiop added c5-half-a-day p1-important Important, aka current backlog of things to do labels May 16, 2019
@efiop
Copy link
Contributor

efiop commented May 16, 2019

Also related to #2012 to support importing directories from packages. Working on it...

@efiop efiop mentioned this issue Jun 3, 2019
10 tasks
@efiop
Copy link
Contributor

efiop commented Jun 4, 2019

s3 support depends on #1654

@efiop
Copy link
Contributor

efiop commented Jun 12, 2019

Also related to #2012 to support importing directories from packages. Working on it...

Not required by packages. Packages support directories through checkout trick, so there is no physical importing happening.

@efiop efiop removed their assignment Jun 18, 2019
@efiop efiop added p2-medium Medium priority, should be done, but less important p3-nice-to-have It should be done this or next sprint and removed p1-important Important, aka current backlog of things to do p2-medium Medium priority, should be done, but less important labels Jun 18, 2019
@shcheklein shcheklein changed the title import: allow downloading directories import-url: allow downloading directories Jul 16, 2019
@nbest937
Copy link

TL;DR. My use case is that upstream Spark / AWS Glue jobs write CSV output as "directories" to S3 with one or more files within. I have no control over the file names that appear there. I would like to treat the S3 prefix as a file tree external dependency. My workaround for now is to simply omit the -d from the dvc run . . . aws s3 cp . . . in my pipeline. This means that DVC will always assume that something has changed, I suppose, which is fine for now. Thanks for working on this enhancement.

@shcheklein shcheklein added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Jul 17, 2019
@shcheklein
Copy link
Member

@efiop since a few people asked about this, raising the priority up. How big is it to implement?

@nbest937 there are a few simple ad-hoc hacks are possible to workaround this.

You can come with a stage that is doing a certain check on the remote directory and prints results into a file. This stage should be the first one. Before the aws s3 cp one. For example it can output number of files in the directory (aws s3 ls ... | wc -l), or you can come up with a special file inside that directory that you will be updating every time you put mode data files into it. updated.at.txt or something and dump this file in the first stage.

This way the second stage will be executed only if the remote directory changed.

@efiop
Copy link
Contributor

efiop commented Jul 17, 2019

@shcheklein Not that big, all the underlying logic was already generalized for ssh support, so just need to implement a bunch of methods like walk() and so on in RemoteS3.

@mazzma12
Copy link

I have files on shared storage organized in folders and used by several projects.
I know that dvc ensures that the files are obtainable by dvc pull but the files are then usable only in one project (therefore they have to be on the storage in their original form = duplicates). I would like to do dvc import -r which will download and track the files. When someone else wants to download the files he will do dvc pull / dvc repro which will download missing files from the storage (from the original position). When files are added to the storage nothing happens. Only when dvc import is called again, new files are added to the project.

Hi, jumping on the train, I would +1 this as I think it is a common workflow. Most of the time the data are already available on an s3-like bucket. Volume is huge and new data are pushed in streaming, they are also shared between several projects. One wants to be able to import the data from the s3 source bucket in our version control, but without duplication on the s3 file system.

@verasativa
Copy link
Contributor

verasativa commented Dec 5, 2019

I'm working on this now, hope to PR today (I'm online at discord channel)
WIP: #2894

@efiop
Copy link
Contributor

efiop commented Jan 7, 2020

Fixed by #2894

@efiop efiop closed this as completed Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

6 participants