-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define new DVC remote using pre-define DVC remote via remote://
#1614
Comments
@PeterFogh hi, Peter! I'm just curious a little bit about your use case (sorry, if you already answered same questions somewhere on Discord, it's just hard to find and track everything that happened there). So, as far as I understand you basically run everything on SSH, but from your local machine, is it correct? I'm just trying to understand what are the benefits of running everything from the local machine instead of just SSH'ing and running the pipeline on the remote machine itself. |
Hi, @shcheklein. I'm glad to elaborate on our use case of DVC. We are a small data science team in a Danish company. Each team member has a local machine (laptop/desktop -> thus low compute power) and we have an on-premise compute server with large memory capacity, large data storage, and a dask scheduler and workers. With this compute server we can launch distributed computing tasks, from our local machines via dask from python scripts and notebooks. The benefit of our setup is:
Hope this description answers your needs. |
@PeterFogh thanks! It makes a lot of sense. How do plan to use DVC in this workflow? What problem are your trying to solve with DVC in the configuration you described? |
@shcheklein. The problem I want to solve is to use a global remote to specify a local remote. The setup would then support that each team member has defined their global remote with their SSH user credentials, and each project (i.e. git repository) define a project-specific remote SSH cache. I expect the DVC command to look like the following:
I'm sorry if this is just a repetition of the first post, but this is my problem the new feature would solve. The key is, that my team members need to use the global config as their SSH credentials are not the same, and that I wish to have a cache folder per project which configuration is stored on Git. |
Thanks @PeterFogh ! I think I understand the idea with multiple independent remote SSH workspaces. What I'm trying to understand better is how do you see the value of DVC in your workflow in general. This question is not even directly related to this feature request, but might help me better understand how this should be implemented (and again, sorry for this question if you answered it before on Discord). |
Hi @shcheklein. I'm not sure what you want answers to. How about we take at voice chat on Discord. I can Monday 11'st March central time between 04.00 to 11.00 - see https://www.worldtimebuddy.com/?qm=1&lid=5391959,6,2624652&h=2624652&date=2019-3-11&sln=10-18. |
Hi @shcheklein and @efiop, it was nice to see you all during our video call last Monday. I could understand that my team' use of DVC with Dask was not intuitive to you and as a result, I have created this repro, https://github.com/PeterFogh/dvc_dask_use_case explaining the setup and it contains an example of a DVC pipeline using remote computation with Dask. I'm not sure if you can use the Github repo for anything, but at least I can use it for testing newer/development versions of DVC against our DVC/Dask setup :) |
@PeterFogh 0.35.5 with this feature is released. Feel free to give it a try 🙂 |
Hi @efiop and @shcheklein. Sorry for the delay on my part, but now I have implemented the use of this new feature "DVC remote via global remote", in my use case repository at https://github.com/PeterFogh/dvc_dask_use_case and it works like a charm 👍 🎉 Good work. As you can see there, my team and I can now have our individual global remotes defining our SSH credentials ones and then specify the project-specific remote cache and project/user-specific remote data location (i.e. what I call the "DVC workspace") in each git repository. Again, thanks a lot for your interest in discussing and implementing this feature :) |
Hi, I want to perform the following configuration, as discussed at this discord message and the following 10-20 messages:
This would enable each team member to have one global remote to connect to the compute server and one DVC repository specify (i.e. project specify) remote SSH cache. It also enables the compute server to have on cache folder per project, which simplifies
dvc gc
, i.e. cache garbage collection.The text was updated successfully, but these errors were encountered: