Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onboarding to DVC Remotes #3867

Closed
5 of 7 tasks
mattseddon opened this issue May 11, 2023 · 10 comments
Closed
5 of 7 tasks

Onboarding to DVC Remotes #3867

mattseddon opened this issue May 11, 2023 · 10 comments
Assignees
Labels
A: onboarding Improving and simplifying users happy path. How do we get them have value asap? blocked Issue or pull request blocked due to other dependencies or issues discussion priority-p1 Regular product backlog 📦 product Needs product input or is being actively worked on story Product feature aka epic. Discussion, progress, checkboxes for implementation, etc

Comments

@mattseddon
Copy link
Member

mattseddon commented May 11, 2023

The main goal of this story is to get new users onboarded to using remotes. We should also provide a happy path for experienced users who are setting up new projects but that is a secondary concern.

Steps

Helping with credentials

Just from looking into the docs setting up some set of quick inputs/picks to assist with this feels like it would be a story by itself. For example AWS can take a configpath, credentialpath, profile... and more, Azure blob storage can take a connection_string or... you get the point. This task wouldn't be impossible but would be time consuming.

Quickstart

tl;dr - I think we should provide a GET and POST endpoint from Studio that hides the implementation completely from the user.

There are a lot of questions surrounding this part of the story. If the idea is to drive Studio usage then we should tie Studio in. This then limits the implementation options. To me, the simplest thing to do would be along the lines of wrapping S3 with the Python AWS SDK in the Studio backend and providing a set of endpoints that we can hit with a specific token/repo URL/ref and associated data (for POST only). To begin with, we can write a standalone client for the extension but (IMO) this should be migrated into DVC when we have the capacity to do so. By this, I mean that "Studio" should be a valid option for the remote entry in the DVC config. The user would be able to put this in their project config along with the token in their local one. I wouldn't go as far as setting Studio as the default if a token is provided and a remote is not but that could also be an option.

cc @shcheklein @dberenbaum. Please LMK what you think. Happy to raise sibling issues in Studio/DVC if required 🙏🏻. Please LMK if I have missed anything important from the initial steps.

@mattseddon mattseddon added story Product feature aka epic. Discussion, progress, checkboxes for implementation, etc priority-p1 Regular product backlog discussion 📦 product Needs product input or is being actively worked on A: onboarding Improving and simplifying users happy path. How do we get them have value asap? labels May 11, 2023
@mattseddon mattseddon self-assigned this May 11, 2023
@dberenbaum
Copy link
Contributor

  • Enable add, default, remove, rename & modify commands to be accessed via the setup page and command palette
  • default could be shoved into add and modify, where we have an option to make it the default.
  • rename could also be part of modify.

Helping with credentials

I think we need a way to at least show what remote types are supported and pick from among those (we could have one type like custom/misc to group less popular remotes). Beyond that, it definitely gets complicated 😅.

Some ideas:

  • For AWS, for example, we could install/ask to install aws cli and/or the aws vs code extension and take them through aws configure or direct them to the aws vs code extension to add a connection. Alternatively, we could just look at the aws configure workflow and ask for those same fields, saving them in either the dvc config or aws config.
  • Try to create a schema for the config file similar to the one for dvc.yaml so that we can suggest what fields are valid for that remote type (since we don't have any remote type field, this is probably hard to do, but maybe we can parse it from the url).

Quickstart

For a studio remote, can it be an http remote with a header/auth option we use for the token? Once you have the token, it should hopefully be enough to fill out the rest of the remote info using what's already supported in dvc.

I also think it might be good to explore gdrive remotes as another quickstart option since most people will have it already. For example, we could provide 2 gdrive paths: 1 for quickstart/personal use and another for shared custom/use. For the quickstart path, we could default to dvc remote add myremote gdrive://appDataFolder, which I think should just work for most users.

@mattseddon
Copy link
Member Author

I think we need a way to at least show what remote types are supported and pick from among those (we could have one type like custom/misc to group less popular remotes). Beyond that, it definitely gets complicated 😅.

Can you give me the list of remotes by popularity? Also which remotes do we get the most questions on?

Some ideas:

  • For AWS, for example, we could install/ask to install aws cli and/or the aws vs code extension and take them through aws configure or direct them to the aws vs code extension to add a connection. Alternatively, we could just look at the aws configure workflow and ask for those same fields, saving them in either the dvc config or aws config.
  • Try to create a schema for the config file similar to the one for dvc.yaml so that we can suggest what fields are valid for that remote type (since we don't have any remote type field, this is probably hard to do, but maybe we can parse it from the URL).

Once you give me the above list I'll see which extensions are available for us to tie in with.

For a studio remote, can it be an http remote with a header/auth option we use for the token? Once you have the token, it should hopefully be enough to fill out the rest of the remote info using what's already supported in dvc.

I also think it might be good to explore gdrive remotes as another quickstart option since most people will have it already. For example, we could provide 2 gdrive paths: 1 for quickstart/personal use and another for shared custom/use. For the quickstart path, we could default to dvc remote add myremote gdrive://appDataFolder, which I think should just work for most users.

I'd recommend that we pick one as both will take effort to implement but more effort to support. Again, if we are trying to drive people towards trying Studio then I think we should relegate gdrive to a placeholder in the helping with credentials/setting up remote section.

@dberenbaum
Copy link
Contributor

Can you give me the list of remotes by popularity?

  1. AWS
  2. GCS
  3. Azure

For "real" remotes, I would start there.

@dberenbaum
Copy link
Contributor

@shcheklein @omesser Do we have a Studio ticket to track this Studio storage story? It seems like we are ready to at least add it to Studio's upcoming priorities, right?

@mattseddon
Copy link
Member Author

I keep bumping up against there being 4 levels of config and not knowing which one(s) to update in order for changes to take effect. Two examples are studio.offline and remote.url. In order to solve this I am going to have to build an in-memory representation of the configs (like dvc config -l --show-origin but with the level attached as well). IMO it would be a good idea to show this to the user as a table under a "Config" section on the Setup page. It would be fairly straightforward to let them open the appropriate config from that section too.

Any thoughts on this?

@shcheklein
Copy link
Member

@mattseddon it sound good, but a bit out of scope for this? For the sake of simplicity we can optimize the "happy path" scenario initially - one storage, being able to setup it (no need to support updates even?), etc. Wdyt?

@mattseddon
Copy link
Member Author

@mattseddon it sound good, but a bit out of scope for this? For the sake of simplicity we can optimize the "happy path" scenario initially - one storage, being able to setup it (no need to support updates even?), etc. Wdyt?

You can already add a remote to the project. Help with credentials/login is started.

@tapadipti
Copy link

I'd like to confirm - the Studio story here is about providing some Studio(Iterative)-hosted storage so that a user can start to push to dvc remotes without setting up their own cloud remotes, right?

@mattseddon
Copy link
Member Author

I'd like to confirm - the Studio story here is about providing some Studio(Iterative)-hosted storage so that a user can start to push to dvc remotes without setting up their own cloud remotes, right?

That is correct. We'd like to provide a "quick start" option for users to onboard to remotes. The idea is to use Studio because that path should:

  • make it more likely that users try Studio and see value there
  • simplify onboarding for DVC/the extension
  • help to drive overall adoption

@shcheklein shcheklein added the blocked Issue or pull request blocked due to other dependencies or issues label Jun 20, 2023
@mattseddon
Copy link
Member Author

Moved the remaining tasks into the above issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: onboarding Improving and simplifying users happy path. How do we get them have value asap? blocked Issue or pull request blocked due to other dependencies or issues discussion priority-p1 Regular product backlog 📦 product Needs product input or is being actively worked on story Product feature aka epic. Discussion, progress, checkboxes for implementation, etc
Projects
None yet
Development

No branches or pull requests

4 participants