Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define new DVC remote using pre-define DVC remote via remote:// #1614

Closed
PeterFogh opened this issue Feb 15, 2019 · 9 comments
Closed

Define new DVC remote using pre-define DVC remote via remote:// #1614

PeterFogh opened this issue Feb 15, 2019 · 9 comments
Labels
enhancement Enhances DVC feature request Requesting a new feature p1-important Important, aka current backlog of things to do

Comments

@PeterFogh
Copy link
Contributor

Hi, I want to perform the following configuration, as discussed at this discord message and the following 10-20 messages:

dvc remote add compute_server ssh://[compute_server_IP]/ --global
dvc remote modify compute_server user [OUR_USERNAME] --global
dvc remote modify compute_server port 22 --global
dvc remote modify compute_server keyfile [PATH_TO_OUR_SSH_KEY] --global
dvc remote add compute_server_project_cache remote://compute_server/dvc_project_cache
dvc config cache.ssh compute_server_project_cache

This would enable each team member to have one global remote to connect to the compute server and one DVC repository specify (i.e. project specify) remote SSH cache. It also enables the compute server to have on cache folder per project, which simplifies dvc gc, i.e. cache garbage collection.

@efiop efiop added enhancement Enhances DVC feature request Requesting a new feature labels Feb 15, 2019
@shcheklein
Copy link
Member

@PeterFogh hi, Peter! I'm just curious a little bit about your use case (sorry, if you already answered same questions somewhere on Discord, it's just hard to find and track everything that happened there). So, as far as I understand you basically run everything on SSH, but from your local machine, is it correct? I'm just trying to understand what are the benefits of running everything from the local machine instead of just SSH'ing and running the pipeline on the remote machine itself.

@PeterFogh
Copy link
Contributor Author

PeterFogh commented Mar 1, 2019

Hi, @shcheklein. I'm glad to elaborate on our use case of DVC.

We are a small data science team in a Danish company. Each team member has a local machine (laptop/desktop -> thus low compute power) and we have an on-premise compute server with large memory capacity, large data storage, and a dask scheduler and workers. With this compute server we can launch distributed computing tasks, from our local machines via dask from python scripts and notebooks.

The benefit of our setup is:

  • Large data files are only stored on the compute server.
  • Dask simply distributes or parallelizes a heavy computation load.
  • The memory of the local machine is not overloaded.

Hope this description answers your needs.

@shcheklein
Copy link
Member

@PeterFogh thanks! It makes a lot of sense. How do plan to use DVC in this workflow? What problem are your trying to solve with DVC in the configuration you described?

@PeterFogh
Copy link
Contributor Author

PeterFogh commented Mar 5, 2019

@shcheklein. The problem I want to solve is to use a global remote to specify a local remote. The setup would then support that each team member has defined their global remote with their SSH user credentials, and each project (i.e. git repository) define a project-specific remote SSH cache.
Thereby, we can simply separate the cache of each project, but each user can still use the shared cache.

I expect the DVC command to look like the following:

cd project_folder
dvc remote add compute_server ssh://[compute_server_IP]/ --global
dvc remote modify compute_server user [OUR_USERNAME] --global
dvc remote modify compute_server port 22 --global
dvc remote modify compute_server keyfile [PATH_TO_OUR_SSH_KEY] --global
dvc remote add compute_server_project_cache remote://compute_server/dvc_project_cache
dvc config cache.ssh compute_server_project_cache

I'm sorry if this is just a repetition of the first post, but this is my problem the new feature would solve.

The key is, that my team members need to use the global config as their SSH credentials are not the same, and that I wish to have a cache folder per project which configuration is stored on Git.

@shcheklein
Copy link
Member

Thanks @PeterFogh ! I think I understand the idea with multiple independent remote SSH workspaces. What I'm trying to understand better is how do you see the value of DVC in your workflow in general. This question is not even directly related to this feature request, but might help me better understand how this should be implemented (and again, sorry for this question if you answered it before on Discord).

@PeterFogh
Copy link
Contributor Author

Hi @shcheklein. I'm not sure what you want answers to. How about we take at voice chat on Discord. I can Monday 11'st March central time between 04.00 to 11.00 - see https://www.worldtimebuddy.com/?qm=1&lid=5391959,6,2624652&h=2624652&date=2019-3-11&sln=10-18.
How does that fit you?

@PeterFogh
Copy link
Contributor Author

Hi @shcheklein and @efiop, it was nice to see you all during our video call last Monday. I could understand that my team' use of DVC with Dask was not intuitive to you and as a result, I have created this repro, https://github.com/PeterFogh/dvc_dask_use_case explaining the setup and it contains an example of a DVC pipeline using remote computation with Dask.

I'm not sure if you can use the Github repo for anything, but at least I can use it for testing newer/development versions of DVC against our DVC/Dask setup :)

@efiop
Copy link
Contributor

efiop commented Apr 9, 2019

@PeterFogh 0.35.5 with this feature is released. Feel free to give it a try 🙂

@PeterFogh
Copy link
Contributor Author

Hi @efiop and @shcheklein. Sorry for the delay on my part, but now I have implemented the use of this new feature "DVC remote via global remote", in my use case repository at https://github.com/PeterFogh/dvc_dask_use_case and it works like a charm 👍 🎉 Good work.

As you can see there, my team and I can now have our individual global remotes defining our SSH credentials ones and then specify the project-specific remote cache and project/user-specific remote data location (i.e. what I call the "DVC workspace") in each git repository.
Additionally, as seen in the conf.py file, we can get the DVC workspace path in the python code via your Python API to simplify the code by only defining the path in DVC config and not both in the DVC config and a hard-coded Python variable.

Again, thanks a lot for your interest in discussing and implementing this feature :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC feature request Requesting a new feature p1-important Important, aka current backlog of things to do
Projects
None yet
Development

No branches or pull requests

4 participants