Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hdfs: Specify a directory as output/input of a pipeline stage #1083

Closed
helderm opened this issue Sep 3, 2018 · 2 comments
Closed

hdfs: Specify a directory as output/input of a pipeline stage #1083

helderm opened this issue Sep 3, 2018 · 2 comments
Labels
enhancement Enhances DVC
Milestone

Comments

@helderm
Copy link

helderm commented Sep 3, 2018

Adding a directory instead of a single file as input/output of a stage would greatly facilitate the use of DVC with large datasets. A common use case would be when using a HDFS remote: a common practice is to partition a dataset in several different files so they can be read/written in parallel, but AFAIK with current implementation of DVC I would have to specify each single file with -d or -o, which could be in terms of hundreds.
If the remotes were made directory-aware, the .dvc file could store a list of checksums, one for each file in the directory.

@efiop
Copy link
Contributor

efiop commented Sep 3, 2018

Hi @helderm !

Dvc currently only supports local directories as both dependencies(-d dirname) and outputs(-o dirname) for the dvc run and they are also supported in dvc add dirname. Unfortunately as of right now directories are not supported for other types of remotes(s3, gs, hdfs, ssh, azure) except local in external output and external dependency scenarios. We will up the priority of this feature and will try to squeeze it into the next release or the one after that. ETA is end of this week.

Thank you for your feedback!

-Ruslan

@efiop efiop self-assigned this Sep 3, 2018
@efiop efiop added the enhancement Enhances DVC label Sep 3, 2018
@efiop efiop added this to the Queue milestone Sep 3, 2018
@efiop efiop changed the title Specify a directory as output/input of a pipeline stage hdfs: Specify a directory as output/input of a pipeline stage Nov 24, 2018
@efiop efiop removed their assignment Nov 24, 2018
@efiop
Copy link
Contributor

efiop commented Feb 4, 2019

Closing in favor of #1275

@efiop efiop closed this as completed Feb 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC
Projects
None yet
Development

No branches or pull requests

2 participants