Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output/dependency/remote: support directories for remotes other than local #1654

Closed
2 of 4 tasks
efiop opened this issue Feb 22, 2019 · 30 comments
Closed
2 of 4 tasks
Labels
feature request Requesting a new feature p1-important Important, aka current backlog of things to do

Comments

@efiop
Copy link
Contributor

efiop commented Feb 22, 2019

Currently we only support directories for local remotes. Need to add support for:

Most likely, this could be achieved in single pass by unifying things like RemoteLOCAL.load_dir_cache into RemoteBase.

@PeterFogh
Copy link
Contributor

My team and I ready need the support of defining directories as output in our setup where we use SSH remote. It would simplify multiple of our DVC stages.

I hope this feature comes soon to SSH😁

@PeterFogh
Copy link
Contributor

@efiop, Is this feature suited for a pull request? Because this is the one feature I need the most. Any recommendations on how to begin on the feature implementation for SSH remote, such as code files, class, function, etc?

@efiop
Copy link
Contributor Author

efiop commented Mar 15, 2019

@PeterFogh I have a patch for supporting it for all remotes lying around in a half-completed state. Let me see if I can speed it up.

@efiop efiop self-assigned this Mar 15, 2019
@PeterFogh
Copy link
Contributor

@efiop, that sounds great. I’m just in the mood for contributing to a open source project this weekend😁 maybe I can find an other (and maybe more simple) issue.
I’m looking forward to seeing your solution for this one.

@efiop
Copy link
Contributor Author

efiop commented Mar 15, 2019

@PeterFogh Sorry for stealing it 😉 Please take a look at help wanted and good first issue labels on our issues list if you still feel like contributing 🙂 Let me know if you need any suggestions.

@shcheklein shcheklein added the p1-important Important, aka current backlog of things to do label Mar 19, 2019
@PeterFogh
Copy link
Contributor

@efiop, no pressure, but do you think this feature will be released within the next week? I can use your estimate for the planning for my team :)

@ghost
Copy link

ghost commented Mar 26, 2019

@PeterFogh , it is on the "Weekly tasks", so I'm sure @efiop is considering working on this, although I can't guarantee it would be released on the next week, at least it will be on a WIP status :)

@efiop
Copy link
Contributor Author

efiop commented Mar 26, 2019

@PeterFogh I am working on this. The patch as a whole is pretty big, I'm in the middle of fixing tests for it (have some other tasks too). Need to fix the rest of the tests, rebase on top of master and submit it. I expect it to be ready during the next week max. Sorry for the delay.

@PeterFogh
Copy link
Contributor

Hi, @MrOutis, @efiop. Thanks for your replies. It sounds awesome and I'm glad to hear that it may be released next week because it fits well with our planning at my end.

@efiop
Copy link
Contributor Author

efiop commented Apr 8, 2019

@PeterFogh Really pushing it to be ready until Thursday. Thank you for your patience.

@PeterFogh
Copy link
Contributor

@efiop sounds awesome. I'm really looking forward to the next release. Especially, if it includes both #1654 and #1614 - patience is ways rewarding! 😄

@efiop
Copy link
Contributor Author

efiop commented Apr 9, 2019

@PeterFogh Actually #1614 has just been released in 0.35.5, please feel free to upgrade. #1654 is coming soon 🙂

@matt-miller-virginia
Copy link

I'd also appreciate S3 directory caching support. My use case is an EMR cluster using Spark to write many GB's of partitioned CSVs to S3 so the data can be copied more easily to AWS Redshift.

@efiop efiop added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important c8-full-day labels Sep 11, 2019
@efiop
Copy link
Contributor Author

efiop commented Sep 17, 2019

As pointed by @Suor , need to check if you could have s3://bucket/path and s3://bucket/path/ at the same time.

@efiop efiop removed their assignment Sep 23, 2019
@ghost ghost self-assigned this Oct 3, 2019
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 8, 2019

What about Azure Blob Storage? E.g. files in ADLSGen2.

Found this issue due to recent Discord conversation.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 8, 2019

Oh. And should some docs be updated already to let people know SSH directories are supported as external dependencies and outputs?

p.s. I notices Azure is not mentioned in our External Dependencies guide, but it's a supported remote. Is the doc incomplete?

Thanks!

@efiop
Copy link
Contributor Author

efiop commented Oct 9, 2019

@jorgeorpinel

What about Azure Blob Storage? E.g. files in ADLSGen2.

We only support it through blob API, there is no native support for it at all right now. Plus, as described in that conversation, ETags on Azure don't seem to be consistent enough to be used as a hash for external outputs.

Oh. And should some docs be updated already to let people know SSH directories are supported as external dependencies and outputs?

Good point. I guess we can include it into iterative/dvc.org#411 ticket?

p.s. I notices Azure is not mentioned in our External Dependencies guide, but it's a supported remote. Is the doc incomplete?

Azure is not supported for external dependencies and outputs. So the doc is fine.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Oct 9, 2019

OK, thanks. Added note to update docs to iterative/dvc.org#411 (comment).

@loche415
Copy link

Hi, does dvc support external storage for s3 currently? It returns "failed to add file - output 's3://dvcbucket/mydata' does not exist" after I run command "dvc add s3://dvcbucket/mydata". However, I do have created bucket "dvcbucket" and folder "mydata" before.
Thanks.

@ghost
Copy link

ghost commented Oct 22, 2019

Hello, @loche415 , I'm working on supporting directories for S3 outputs and dependencies. There's a work in progress already, it should be ready for tomorrow :)

@loche415
Copy link

Thanks for your hard work @MrOutis and looking forward to this new feature.

@efiop
Copy link
Contributor Author

efiop commented Oct 25, 2019

@matt-miller-virginia @loche415 Hi guys! Support for s3 directories is already released in 0.66.0, please upgrade and give it a try :) Big thanks to @MrOutis for implementing it! 🙏 And thank you guys for the feedback! 🙂

@efiop
Copy link
Contributor Author

efiop commented Oct 25, 2019

For the record, s3 was added to the doc ticket too iterative/dvc.org#411

@efiop
Copy link
Contributor Author

efiop commented Nov 12, 2019

Closing for now, as no one is asking for gs support and hdfs is not possible to support right now(see comment in the issue desc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p1-important Important, aka current backlog of things to do
Projects
None yet
Development

No branches or pull requests

6 participants