Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Drive support further enhancements #2865

Closed
22 of 33 tasks
maxhora opened this issue Nov 28, 2019 · 27 comments
Closed
22 of 33 tasks

Google Drive support further enhancements #2865

maxhora opened this issue Nov 28, 2019 · 27 comments
Assignees
Labels
enhancement Enhances DVC help wanted p2-medium Medium priority, should be done, but less important

Comments

@maxhora
Copy link
Contributor

maxhora commented Nov 28, 2019

This ticket is to keep track of required further improvements for Google Drive remote implementation after #2551 will be merged into master:

  • Processing of auth exceptions and printing more meaningful error message on failure. GDrive remote support #2551 (comment)

  • Implement gc command support for gdrive ( def remove(self, path_info) overloading )

  • Validate resolved remote file object to have expected title ( def resolve_remote_item_from_path(self, path_parts, create): )

  • Simplify a few things using new and shiny @wrap_prop and filter_errors param of @retry from funcy 1.14 :)

  • Make name -> id deterministic by choosing minimum id GDrive remote support #2551 (comment)

  • Reimplement caching to have 2 entry points .path_info_to_ids() and .id_to_path_info(). The idea is to remove cache dictionaries completely, but it will require introduction of helper method which accepts parent_id, title and returns the actual id ( this method introduces 1 -> 1 relation between input and resulting remote id ). GDrive remote support #2551 (comment)

  • Simplify resolve_remote_item_from_path GDrive remote support #2551 (comment)

  • Protect create_remote_dir with lock. GDrive remote support #2551 (comment)

  • Organize methods, params in ordered and unified way. GDrive remote support #2551 (comment)

  • Enhance retrieving of HTTP error from GDrive API exceptions maxhora@783d8b6#r36184990

  • Create Iterative Google Drive account and Project to share client id and secret with final DVC users. Highest possible API usage limits quotas should be requested. Probably, it will be needed to have separate Google Project for CI. GDrive remote support #2551 (comment)

  • Stop creating a path to the DVC remote root if it does not exist. Since Google Drive allows multiple folders with the same name (at least on in My Drive and one in Shared With Me) in case a path like gdrive://root/storage is used to access, collaborators see a storage empty folder after the first dvc pull attempt`. Instead we should just throw path does not exist.

  • Hide /Users/ivan/Projects/test-gdrive/.env/lib/python3.7/site-packages/oauth2client/_helpers.py:255: UserWarning: Cannot access /Users/ivan/Projects/test-gdrive/.dvc/tmp/gdrive-user-credentials.json: No such file or directory warnings.warn(_MISSING_FILE_MESSAGE.format(filename)) on the first auth.

  • Reconsider self.no_traverse = False. Now even to pull a single file we run the full traversal. We can at least start listing only prefixes we need (e.g. remote/0c/* if we need to check if file 0c1234...ef exists), we can use a parallel exists if we need to check less than 256 files, etc.

  • Put notes in docs that path like gdrive://root is not accessible by other people - that ID must be used to actually share data with other team members.

  • We should store one credentials file per remote

  • Do not allow empty root DVC remote path (except shared?). I think it can prevent a lot of strange issues. More on this here remote add: should gdrive://root be an error? #3586

  • Add _download progress Google Drive support further enhancements #2865 (comment) -> gdrive: download: stream & add progress #3722

  • Support file streaming in dvc.api.open() function (more details in api: support streaming from Google Drive remotes #3408) and and update docs

  • Add import-url support

  • Check that close() is handles in the dvc.api.open() context manager.

  • Support show URL in dvc get for Gdrive. We can generate a HTTPS link that can be used in a browser if user is authed. The same link as you would get if you download a file from UI.

  • Fix credentials management for external repos (when we do dvc get, etc). We should have a way to cache them.

  • Support external dependencies and outputs

  • Check that dvc get-url functions properly. Need to come up with some credentials management.

  • Review and add more tests if needed. Starting from API tests.

  • Add an explanation comment about how path vs ids work, all non 1-1 stuff. Someone running into GDrive for first time will appreciate GDrive remote support #2551 (review). Explain how we deal with it, how caching is involved too.

  • Notify user on retries, keep the message on the screen if they are happening at a certain rate

  • Consider raising an exception if there are multiple remote root directories with the same name

  • Move _gdrive_* helpers into a separate module api to simplify testing and reading the code

  • When it's more or less stable make it trusted by default

  • Check if there is a way to generate URLs for GDrive publicly available files to download them w/o auth. Examples: https://github.com/NVlabs/stylegan/blob/master/pretrained_example.py and https://github.com/NVlabs/stylegan/blob/master/dnnlib/util.py . dvc get ideally should work w/o asking to Auth for public objects.

  • Improve performance, pass fields, consider using batch for exists call - https://developers.google.com/drive/api/v2/performance , see example here docs: How to get specific fields when listing files. PyDrive2#42

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Nov 28, 2019
@maxhora maxhora mentioned this issue Nov 28, 2019
2 tasks
@shcheklein
Copy link
Member

@Maxris would be helpful if you include links to the discussions in the initial PR.

@maxhora
Copy link
Contributor Author

maxhora commented Nov 28, 2019

@shcheklein sure 👍

@casperdcl
Copy link
Contributor

casperdcl commented Dec 2, 2019

also add _download progress (atm only uploads will provide progress). Specifically,

gdrive_file.GetContentFile(to_file)

see #2875 (comment) (source at https://github.com/gsuitedevs/PyDrive/blob/68cea204cdcd10bab9321580164c8f4961385a0f/pydrive/files.py#L202) which means this is non-trivial to add

@casperdcl casperdcl mentioned this issue Dec 2, 2019
8 tasks
@maxhora
Copy link
Contributor Author

maxhora commented Dec 3, 2019

@casperdcl thanks, have added this to the list

@maxhora
Copy link
Contributor Author

maxhora commented Dec 11, 2019

@efiop @Suor could you please review the order of priority for available things to be addressed in this ticket? Thanks!

@Suor
Copy link
Contributor

Suor commented Dec 12, 2019

@Maxris maybe start from tests, then pick whatever you like.

@antonkulaga
Copy link

One more GDrive related issue that can be added to the list #3015

@tierney6
Copy link

Hi all, here is a screenshot of an issue I have with pushing to gdrive, and the subsequent creation of duplicate folders
Screenshot from 2020-01-14 10-17-39

@efiop
Copy link
Contributor

efiop commented Jan 14, 2020

@tierney6 Thanks for reporting! @Maxris is currently researching the issue. Stay tuned! 🙂

@jorgeorpinel
Copy link
Contributor

(Added "Support file streaming in dvc.api.open() function (more details in #3408)" checkbox.)

@shcheklein
Copy link
Member

@Maxris @Suor do you remember what Validate resolved remote file object to have expected title ( def resolve_remote_item_from_path(self, path_parts, create): ) was about?

@maxhora
Copy link
Contributor Author

maxhora commented Mar 2, 2020

@shcheklein just to put assert to check remote file has expected title. The idea was to add more validation of data received from GDrive API. For example, if we resolved file by id, it makes sense to validate the title with expected.

@shcheklein
Copy link
Member

@Maxris k, so meaning that we need to put an assert in the _get_remote_item ?

@maxhora
Copy link
Contributor Author

maxhora commented Mar 2, 2020

yes, exactly

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 3, 2020

Put notes in docs that path like gdrive://root is not accessible by other people - that ID must be used to actually share data with other team members.

I'm addressing this in iterative/dvc.org#1096 (review)

Also, how about

  • Add dvc config gdrive.client_id and gdrive.client_secret config options so you don't have to set it up per remote

? Maybe also for service account params.

@shcheklein
Copy link
Member

Also, how about

not sure about this, I don't see that much difference with other remotes and other params they have

@casperdcl
Copy link
Contributor

casperdcl commented May 2, 2020

Regarding adding _download progress, it's super hard because of the sheer stack of libraries and functions which were not designed for any form of iteration or callbacks in mind. There are a lot of places where it may look like progress could be added but really you'd only be monitoring a pipe after the file has been downloaded.

CC @shcheklein the stack:

  • dvc.remote.gdrive.GDriveRemote._gdrive_download_file()
    • gdrive_file.GetContentFile()
    • == pydrive2.files.GoogleDriveFile.GetContentFile()
      • FetchContent()
        • _DownloadFromUrl()
          • self.http.request()
          • == httplib2.Http.request()
          • The return value is [...] a string that contains the response entity body

            • _request(self, conn : httplib2.HTTPSConnectionWithTimeout, ...)
              • _conn_request(self, conn, ...)
                • response.read()
                  • where response == conn.getresponse()
                  • == httplib2.HTTPSConnectionWithTimeout.getresponse()
                  • == http.client.HTTPSConnection.getresponse()
                  • == http.client.HTTPResponse
                • http.client.HTTPResponse.read()
                  • _readall_chunked()
                    • can easily add progress here

Could monkey-patch doing:

class Custom(http.client.HTTPResponse):
    def _readall_chunked(...):
        with Tqdm(...) as pbar:
            ...

httplib2.HTTPConnectionWithTimeout.response_class = Custom
gdrive_file.GetContentFile()
httplib2.HTTPConnectionWithTimeout.response_class = http.client.HTTPResponse

but it's not pretty.

Opened upstream httplib2/httplib2#165

casperdcl added a commit to casperdcl/dvc that referenced this issue May 5, 2020
efiop pushed a commit that referenced this issue May 6, 2020
* wip

* gdrive: add progress

Part of #2865
See #2865 (comment)

* gdrive: move towards next pydrive2 release

- depends on iterative/PyDrive2#30

* update to latest pydrive>=1.4.11

* avoid unneeded API call

* progress: gdrive: ensure proper bar_format
casperdcl added a commit to casperdcl/dvc that referenced this issue May 30, 2020
@casperdcl casperdcl mentioned this issue May 30, 2020
8 tasks
casperdcl added a commit to casperdcl/dvc that referenced this issue Jun 12, 2020
efiop pushed a commit that referenced this issue Jun 13, 2020
* gdrive: add open

Fixes #3408
Related #2865
Fixes #3897

* dependency: add gdrive

* test: api: open: gdrive

* rollback pydrive dependency

* Revert "dependency: add gdrive"

This reverts commit fd33326.

* tests: remotes: GDrive: fix get_url

* tests: api: fully test GDrive

* minor typo fix

* attempt gdrive tests fix

* fix gdrive credentials config

* replace GDrivePathNotFound => FileMissingError

* fix test_open_external[GDrive]

* add ensure_dir_scm

* minor exception fix
@shcheklein shcheklein added p2-medium Medium priority, should be done, but less important help wanted labels Oct 28, 2020
@julsal
Copy link
Contributor

julsal commented Nov 3, 2020

Dear @shcheklein and @efiop, I'm starting working today on the "add import-url support for GDrive"

@dantp-ai
Copy link

dantp-ai commented Jul 1, 2021

Dear @julsal, I am glad you're implementing it. I presume your current work is here: https://github.com/iterative/dvc/tree/gdrive-import

When do u expect to finish it?

@efiop
Copy link
Contributor

efiop commented Jul 1, 2021

@plopd We are migrating to fsspec right now iterative/PyDrive2#113 and will ressurect import-url support for gdrive after that is sorted. ETA is August.

@dantp-ai
Copy link

dantp-ai commented Jul 1, 2021

@efiop Great. Until then, I think we should remove the gdrive row, since really as of now, DVC is not supporting that protocol type. I got confused when I tried to use import-url for gdrive only finding out that it's not even available in the newest version.

efiop added a commit to iterative/dvc.org that referenced this issue Jul 1, 2021
jorgeorpinel pushed a commit to iterative/dvc.org that referenced this issue Jul 6, 2021
* import-url: remove gdrive

iterative/dvc#2865 (comment)

* get-url: remove gdrive

* Restyled by prettier (#2598)

Co-authored-by: Restyled.io <[email protected]>

Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com>
Co-authored-by: Restyled.io <[email protected]>
@itcarroll
Copy link

With iterative/PyDrive2#113 closed, would it be possible to resume GDrive support for import-url?

@pmrowla
Copy link
Contributor

pmrowla commented Sep 28, 2021

cc: @efiop @isidentical

@isidentical
Copy link
Contributor

I think we should close this issue, and open separate ones for upcoming feature requests. This seems like a really giant but outdated epic.

With iterative/PyDrive2#113 closed, would it be possible to resume GDrive support for import-url?

As far as I can tell, everything should be there for import-url, so let's create another issue to finalize the support / add it docs etc. (if I am not missing something)

@pmrowla
Copy link
Contributor

pmrowla commented Oct 5, 2021

Closing this meta-ticket since it's very outdated. Separate issues for any missing features in the fssspec based gdrivefs should be opened as needed.

@pmrowla pmrowla closed this as completed Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC help wanted p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests