-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import-url: allow downloading directories #1861
Comments
Hi @tivvit ! For the record: Transferred this issue from Great suggestion! Could please elaborate on what you expect from that
Thanks, |
Sorry for the misplacement. I was thinking about the two possibilities before you asked. For my use cases the Maybe you have some statistics what is used more? That can suggest which option should we implement first. |
@tivvit No worries 🙂 We don't have exact numbers, but from my personal feel |
I will take a look at it and submit PR |
@tivvit Thanks! ❤️ Btw, we have our chat at https://dvc.org/chat , feel free to join :) Also, check out our contribution guide at https://dvc.org/doc/user-guide/contributing . |
@tivvit could you elaborate a little bit on your |
I have started implementing it last week and I have also studied the architecture and the ideology. The outcome is here tivvit@4dd1d9a but it is not finished. I have encountered some problems and I have questions:
My use case (@shcheklein)
Off-topic notes / questions
I am saying all this because I wanted to build a tool like this but then I found dvc and it is really similar to the approach I had in my head but some of the workflows are different. Therefore I am asking why some decisions were made (I think that many answers will be that dvc is totally focused to reproducibility and simple use) and if you are willing to support such use cases in the project? |
It should behave that way by default. You should use
Sorry, I'm not sure I follow. Could you elaborate please?
It is simply not implemented yet. From import's perspective
Yes, writing dvc files by hand is a totally valid usage.
Dvcfiles are in simple yaml format and are intended to be human-readable and editable. We've discussed creating a separate place for checksums, but decided that it would only pollute the workspace and make everything much more obscure for the user, especially when merging is involved. |
Also related to #2012 to support importing directories from packages. Working on it... |
s3 support depends on #1654 |
Not required by packages. Packages support directories through |
TL;DR. My use case is that upstream Spark / AWS Glue jobs write CSV output as "directories" to S3 with one or more files within. I have no control over the file names that appear there. I would like to treat the S3 prefix as a file tree external dependency. My workaround for now is to simply omit the |
@efiop since a few people asked about this, raising the priority up. How big is it to implement? @nbest937 there are a few simple ad-hoc hacks are possible to workaround this. You can come with a stage that is doing a certain check on the remote directory and prints results into a file. This stage should be the first one. Before the This way the second stage will be executed only if the remote directory changed. |
@shcheklein Not that big, all the underlying logic was already generalized for ssh support, so just need to implement a bunch of methods like |
Hi, jumping on the train, I would +1 this as I think it is a common workflow. Most of the time the data are already available on an s3-like bucket. Volume is huge and new data are pushed in streaming, they are also shared between several projects. One wants to be able to import the data from the s3 source bucket in our version control, but without duplication on the s3 file system. |
I'm working on this now, hope to PR today (I'm online at discord channel) |
Fixed by #2894 |
It is possible to add directory (
dvc add ...
) but it does not work fordvc import
.I tried the import for s3 remote and it is possible for a single file but not for a directory. I think it should be possible to list files and download them (it may be slow) or maybe download it as a zip and unzip it (it is supported in Minio). Are you willing to support that feature? I may prepare the support for s3 remote (local and ssh remotes should be quite simple too). It may be forced with some option like
-r
which may not be supported for some remotes (http)The text was updated successfully, but these errors were encountered: