-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
output/dependency/remote: support directories for remotes other than local #1654
Comments
My team and I ready need the support of defining directories as output in our setup where we use SSH remote. It would simplify multiple of our DVC stages. I hope this feature comes soon to SSH😁 |
@efiop, Is this feature suited for a pull request? Because this is the one feature I need the most. Any recommendations on how to begin on the feature implementation for SSH remote, such as code files, class, function, etc? |
@PeterFogh I have a patch for supporting it for all remotes lying around in a half-completed state. Let me see if I can speed it up. |
@efiop, that sounds great. I’m just in the mood for contributing to a open source project this weekend😁 maybe I can find an other (and maybe more simple) issue. |
@PeterFogh Sorry for stealing it 😉 Please take a look at |
@efiop, no pressure, but do you think this feature will be released within the next week? I can use your estimate for the planning for my team :) |
@PeterFogh , it is on the "Weekly tasks", so I'm sure @efiop is considering working on this, although I can't guarantee it would be released on the next week, at least it will be on a WIP status :) |
@PeterFogh I am working on this. The patch as a whole is pretty big, I'm in the middle of fixing tests for it (have some other tasks too). Need to fix the rest of the tests, rebase on top of master and submit it. I expect it to be ready during the next week max. Sorry for the delay. |
@PeterFogh Really pushing it to be ready until Thursday. Thank you for your patience. |
@PeterFogh Actually #1614 has just been released in 0.35.5, please feel free to upgrade. #1654 is coming soon 🙂 |
I'd also appreciate S3 directory caching support. My use case is an EMR cluster using Spark to write many GB's of partitioned CSVs to S3 so the data can be copied more easily to AWS Redshift. |
As pointed by @Suor , need to check if you could have s3://bucket/path and s3://bucket/path/ at the same time. |
What about Azure Blob Storage? E.g. files in ADLSGen2.
|
Oh. And should some docs be updated already to let people know SSH directories are supported as external dependencies and outputs? p.s. I notices Azure is not mentioned in our External Dependencies guide, but it's a supported remote. Is the doc incomplete? Thanks! |
We only support it through blob API, there is no native support for it at all right now. Plus, as described in that conversation, ETags on Azure don't seem to be consistent enough to be used as a hash for external outputs.
Good point. I guess we can include it into iterative/dvc.org#411 ticket?
Azure is not supported for external dependencies and outputs. So the doc is fine. |
OK, thanks. Added note to update docs to iterative/dvc.org#411 (comment). |
Hi, does dvc support external storage for s3 currently? It returns "failed to add file - output 's3://dvcbucket/mydata' does not exist" after I run command "dvc add s3://dvcbucket/mydata". However, I do have created bucket "dvcbucket" and folder "mydata" before. |
Hello, @loche415 , I'm working on supporting directories for S3 outputs and dependencies. There's a work in progress already, it should be ready for tomorrow :) |
Thanks for your hard work @MrOutis and looking forward to this new feature. |
@matt-miller-virginia @loche415 Hi guys! Support for s3 directories is already released in 0.66.0, please upgrade and give it a try :) Big thanks to @MrOutis for implementing it! 🙏 And thank you guys for the feedback! 🙂 |
For the record, s3 was added to the doc ticket too iterative/dvc.org#411 |
Closing for now, as no one is asking for gs support and hdfs is not possible to support right now(see comment in the issue desc). |
Currently we only support directories for local remotes. Need to add support for:
hdfsCan't support because the checksum is based on crc's which has a significant chance of overlapping when you have many files in cache, which is unusable for dirs.gsNo one asked for thisNow someone did ask gs: support directories as external dependencies/outputs #2814Most likely, this could be achieved in single pass by unifying things like RemoteLOCAL.load_dir_cache into RemoteBase.
The text was updated successfully, but these errors were encountered: