Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HttpError 400 when pushing to Google Drive #3098

Closed
RomanVeretenov opened this issue Jan 10, 2020 · 12 comments
Closed

HttpError 400 when pushing to Google Drive #3098

RomanVeretenov opened this issue Jan 10, 2020 · 12 comments
Labels
awaiting response we are waiting for your reply, please respond! :) bug Did we break something? p1-important Important, aka current backlog of things to do

Comments

@RomanVeretenov
Copy link

RomanVeretenov commented Jan 10, 2020

DVC version - 0.80.0, Intalled via pip
Ubuntu 18.04.2 LTS

I'm getting following error when pushing to google drive remote
ERROR: unexpected error - <HttpError 400 when requesting https://www.googleapis.com/drive/v2/files returned "Invalid query">

I have a repo with a lot of files (about several kk). My GDrive storage has unlimited space.

Pushing to "local remote" works fine. But after adding a gdrive remote, first I got an "ulimit error" (fixed by adding --jobs 8 param to dvc push), and now I always get HttpError 400.

I've run the push command with -v also. I can't paste full log here because it's huge and reproducing this issue with -v takes several hours. But the tail is following:

DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
DEBUG: cache '.dvc/cache/71/093b090f3c128c0acfb621e0529016' expected '71093b090f3c128c0acfb621e0529016' actual '71093b090f3c128c0acfb621e0529016'
DEBUG: Path .dvc/cache/98/07cad22fba84ba0a6f36cb2c65d9b8 inode 58056549
DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
DEBUG: fetched: [('1577438884511567104', '8266', '9807cad22fba84ba0a6f36cb2c65d9b8', '1578597521440623104')]
DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
DEBUG: cache '.dvc/cache/98/07cad22fba84ba0a6f36cb2c65d9b8' expected '9807cad22fba84ba0a6f36cb2c65d9b8' actual '9807cad22fba84ba0a6f36cb2c65d9b8'
DEBUG: Path .dvc/cache/26/ee9f409e2b5ca63043c8069964f19c inode 58056554
DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
DEBUG: fetched: [('1577438884511567104', '8860', '26ee9f409e2b5ca63043c8069964f19c', '1578597521440708352')]
DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
DEBUG: cache '.dvc/cache/26/ee9f409e2b5ca63043c8069964f19c' expected '26ee9f409e2b5ca63043c8069964f19c' actual '26ee9f409e2b5ca63043c8069964f19c'
DEBUG: Path .dvc/cache/54/cb6d33cdadc555eafc7dff2a2148ca inode 58056552
DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
DEBUG: fetched: [('1577438884511567104', '10836', '54cb6d33cdadc555eafc7dff2a2148ca', '1578597521440793344')]
DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
DEBUG: cache '.dvc/cache/54/cb6d33cdadc555eafc7dff2a2148ca' expected '54cb6d33cdadc555eafc7dff2a2148ca' actual '54cb6d33cdadc555eafc7dff2a2148ca'
DEBUG: Path .dvc/cache/71/eb2c036e63e1f8be8a16e0f64997a8 inode 58056553
DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
DEBUG: fetched: [('1577438884511567104', '10687', '71eb2c036e63e1f8be8a16e0f64997a8', '1578597521440878336')]
DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
DEBUG: cache '.dvc/cache/71/eb2c036e63e1f8be8a16e0f64997a8' expected '71eb2c036e63e1f8be8a16e0f64997a8' actual '71eb2c036e63e1f8be8a16e0f64997a8'
DEBUG: Path .dvc/cache/c9/f744d51e4ecee33f9584a7329f5e5e inode 58056558
DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
DEBUG: fetched: [('1577438884511567104', '8707', 'c9f744d51e4ecee33f9584a7329f5e5e', '1578597521440963840')]
DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
DEBUG: cache '.dvc/cache/c9/f744d51e4ecee33f9584a7329f5e5e' expected 'c9f744d51e4ecee33f9584a7329f5e5e' actual 'c9f744d51e4ecee33f9584a7329f5e5e'
DEBUG: Path .dvc/cache/f8/d9eff37845bf80b9babb3b5a32b5b7 inode 58056556
DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
DEBUG: fetched: [('1577438884511567104', '9227', 'f8d9eff37845bf80b9babb3b5a32b5b7', '1578597521441049600')]
DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
DEBUG: cache '.dvc/cache/f8/d9eff37845bf80b9babb3b5a32b5b7' expected 'f8d9eff37845bf80b9babb3b5a32b5b7' actual 'f8d9eff37845bf80b9babb3b5a32b5b7'
DEBUG: Path .dvc/cache/83/0c83c3f87ffbe2ea7b7a26ad948a1f inode 58056557
DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
DEBUG: fetched: [('1577438884511567104', '11107', '830c83c3f87ffbe2ea7b7a26ad948a1f', '1578597521441135360')]
DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
DEBUG: cache '.dvc/cache/83/0c83c3f87ffbe2ea7b7a26ad948a1f' expected '830c83c3f87ffbe2ea7b7a26ad948a1f' actual '830c83c3f87ffbe2ea7b7a26ad948a1f'
DEBUG: Path .dvc/cache/ba/8ede7cb76ea3b76c77194255a719a3 inode 58056562
DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
DEBUG: fetched: [('1577438884511567104', '9197', 'ba8ede7cb76ea3b76c77194255a719a3', '1578597521441220608')]
DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
DEBUG: cache '.dvc/cache/ba/8ede7cb76ea3b76c77194255a719a3' expected 'ba8ede7cb76ea3b76c77194255a719a3' actual 'ba8ede7cb76ea3b76c77194255a719a3'
DEBUG: Path .dvc/cache/b2/4c6b59b2fa3b5185a277a2e69890a3 inode 58056560
DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
DEBUG: fetched: [('1577438884511567104', '11572', 'b24c6b59b2fa3b5185a277a2e69890a3', '1578597521441305344')]
DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
DEBUG: cache '.dvc/cache/b2/4c6b59b2fa3b5185a277a2e69890a3' expected 'b24c6b59b2fa3b5185a277a2e69890a3' actual 'b24c6b59b2fa3b5185a277a2e69890a3'
DEBUG: Path .dvc/cache/9c/d5fe6f6f61eb41297fd1d6fc6fa8e0 inode 58056561
DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
DEBUG: fetched: [('1577438884511567104', '10263', '9cd5fe6f6f61eb41297fd1d6fc6fa8e0', '1578597521441390336')]
DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
DEBUG: cache '.dvc/cache/9c/d5fe6f6f61eb41297fd1d6fc6fa8e0' expected '9cd5fe6f6f61eb41297fd1d6fc6fa8e0' actual '9cd5fe6f6f61eb41297fd1d6fc6fa8e0'
DEBUG: Path .dvc/cache/a0/ba801ae101d3b6ed38b4408364606b inode 58056563
DEBUG: SELECT mtime, size, md5, timestamp from state WHERE inode=?
DEBUG: fetched: [('1577439940173467904', '31477647', 'a0ba801ae101d3b6ed38b4408364606b', '1578597521441476608')]
DEBUG: UPDATE state SET timestamp = ? WHERE inode = ?
DEBUG: cache '.dvc/cache/a0/ba801ae101d3b6ed38b4408364606b' expected 'a0ba801ae101d3b6ed38b4408364606b' actual 'a0ba801ae101d3b6ed38b4408364606b'
DEBUG: Collecting information from remote cache...
DEBUG: SELECT count from state_info WHERE rowid=?
DEBUG: fetched: [(4393492,)]
DEBUG: UPDATE state_info SET count = ? WHERE rowid = ?
ERROR: unexpected error - <HttpError 400 when requesting https://www.googleapis.com/drive/v2/files returned "Invalid query">


Traceback (most recent call last):
File "/home/romav/miniconda3/lib/python3.6/site-packages/dvc/main.py", line 48, in main
ret = cmd.run()
File "/home/romav/miniconda3/lib/python3.6/site-packages/dvc/command/data_sync.py", line 49, in run
recursive=self.args.recursive,
File "/home/romav/miniconda3/lib/python3.6/site-packages/dvc/repo/init.py", line 32, in wrapper
ret = f(repo, *args, **kwargs)
File "/home/romav/miniconda3/lib/python3.6/site-packages/dvc/repo/push.py", line 25, in push
return self.cloud.push(used, jobs, remote=remote)
File "/home/romav/miniconda3/lib/python3.6/site-packages/dvc/data_cloud.py", line 62, in push
show_checksums=show_checksums,
File "/home/romav/miniconda3/lib/python3.6/site-packages/dvc/remote/local.py", line 379, in push
download=False,
File "/home/romav/miniconda3/lib/python3.6/site-packages/dvc/remote/local.py", line 352, in _process
download=download,
File "/home/romav/miniconda3/lib/python3.6/site-packages/dvc/remote/local.py", line 273, in status
md5s, jobs=jobs, name=str(remote.path_info)
File "/home/romav/miniconda3/lib/python3.6/site-packages/dvc/remote/base.py", line 796, in cache_exists
return list(set(checksums) & set(self.all()))
File "/home/romav/miniconda3/lib/python3.6/site-packages/dvc/remote/gdrive.py", line 328, in all
for file1 in self.gdrive_list_item(query):
File "/home/romav/miniconda3/lib/python3.6/site-packages/funcy/decorators.py", line 39, in wrapper
return deco(call, *dargs, **dkwargs)
File "/home/romav/miniconda3/lib/python3.6/site-packages/funcy/flow.py", line 122, in retry
return call()
File "/home/romav/miniconda3/lib/python3.6/site-packages/funcy/decorators.py", line 60, in call
return self._func(*self._args, **self._kwargs)
File "/home/romav/miniconda3/lib/python3.6/site-packages/funcy/decorators.py", line 39, in wrapper
return deco(call, *dargs, **dkwargs)
File "/home/romav/miniconda3/lib/python3.6/site-packages/dvc/remote/gdrive.py", line 42, in _wrap_pydrive_retriable
result = call()
File "/home/romav/miniconda3/lib/python3.6/site-packages/funcy/decorators.py", line 60, in call
return self._func(*self._args, **self._kwargs)
File "/home/romav/miniconda3/lib/python3.6/site-packages/dvc/remote/gdrive.py", line 141, in
get_list = gdrive_retry(lambda: next(file_list, None))
File "/home/romav/miniconda3/lib/python3.6/site-packages/pydrive/apiattr.py", line 146, in next
result = self._GetList()
File "/home/romav/miniconda3/lib/python3.6/site-packages/pydrive/auth.py", line 75, in _decorated
return decoratee(self, *args, **kwargs)
File "/home/romav/miniconda3/lib/python3.6/site-packages/pydrive/files.py", line 64, in _GetList
http=self.http)
File "/home/romav/miniconda3/lib/python3.6/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "/home/romav/miniconda3/lib/python3.6/site-packages/googleapiclient/http.py", line 856, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://www.googleapis.com/drive/v2/files returned "Invalid query">


Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Jan 10, 2020
@shcheklein
Copy link
Member

@Maxris any ideas from the top of your head? At least, how should we debug this? Add more logging?

@maxhora
Copy link
Contributor

maxhora commented Jan 10, 2020

it's strange that url from HttpError doesn't contain any query params.

@RomanVeretenov could you please give an idea what URL format do you have in the config for remote gdrive?

For example, it should be like gdrive://root/DVC. Where DVC is the folder name in your GDrive's topmost "root" directory. It's better to pre-created DVC folder manually (there is known issue with race conditions which leads to creation of multiple directories if directory doesn't exist).

@shcheklein
Copy link
Member

Btw, @RomanVeretenov check this doc https://github.com/pared/dvc.org/edit/2473/public/static/docs/user-guide/troubleshooting.md?pr=%2Fiterative%2Fdvc.org%2Fpull%2F875 (not merged yet). I think it's better to increase ulimit than using --jobs 8. It would instead increase it significantly. It can speedup things a lot.

Also, could you confirm by chance if it always takes the same amount of time before it breaks? Also, is there a chance that someone is using the same DVC remote at the same moment?

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Jan 11, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Jan 11, 2020
@efiop efiop added bug Did we break something? triage Needs to be triaged labels Jan 11, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Jan 11, 2020
@RomanVeretenov
Copy link
Author

RomanVeretenov commented Jan 13, 2020

it's strange that url from HttpError doesn't contain any query params.

@RomanVeretenov could you please give an idea what URL format do you have in the config for remote gdrive?

For example, it should be like gdrive://root/DVC. Where DVC is the folder name in your GDrive's topmost "root" directory. It's better to pre-created DVC folder manually (there is known issue with race conditions which leads to creation of multiple directories if directory doesn't exist).

gdrive://root/_dvc/myrepo

At first run, dvc added some files to gdrive folder, so I don't think there was some error caused by race.

@RomanVeretenov
Copy link
Author

RomanVeretenov commented Jan 13, 2020

Btw, @RomanVeretenov check this doc https://github.com/pared/dvc.org/edit/2473/public/static/docs/user-guide/troubleshooting.md?pr=%2Fiterative%2Fdvc.org%2Fpull%2F875 (not merged yet). I think it's better to increase ulimit than using --jobs 8. It would instead increase it significantly. It can speedup things a lot.

I don't remember if I tried to increase ulimit before limiting --jobs, but I will try to investigate it.

Also, could you confirm by chance if it always takes the same amount of time before it breaks?

At first run, it has added some of the files to gdrive. And at second run and later, it takes almost the same amount of time. I can run the command with -v again, redirect output to a file and compare 2logs from 2 attempts.

Also, is there a chance that someone is using the same DVC remote at the same moment?

Absolutely no chance.

@shcheklein
Copy link
Member

At first run, it has added some of the files to gdrive. And at second run and later, it takes almost the same amount of time. I can run the command with -v again, redirect output to a file and compare 2logs from 2 attempts.

It looks like it fails on collecting existing files on the remote. I think it's using some pagination mechanism and I wonder if the "cursor" just expires or something? @Maxris any ideas? May be we can put more debug logs to catch the error?

@efiop efiop added the p1-important Important, aka current backlog of things to do label Jan 15, 2020
@RomanVeretenov
Copy link
Author

I have implemented a workaround on my side. I zip all small files at the last level of folders hierarchy and store them in dvc as archives. Also I have setup hooks for unpacking them after git pull/checkout.

I have done it also to speedup the syncing with local remote. Syncing a lot of small files is really slow.

I have recreated the gdrive remote and the error is gone. But the gdrive remote still behaves strange. I'll open another issue.

@shcheklein
Copy link
Member

@RomanVeretenov what is the number of files in the folder, btw? I'll try to reproduce this 400 error on my end.

@RomanVeretenov
Copy link
Author

@RomanVeretenov what is the number of files in the folder, btw? I'll try to reproduce this 400 error on my end.

Before I migrated to new storage structure (zip all files in the bottom-level directories), it was about 2 million

@shcheklein
Copy link
Member

@RomanVeretenov do you still have that remote storage that you was pushing these files initially? To try to reproduce this with the latest DVC release?

I tried to push ~300K files and then run dvc status (that was causing your issue initially). It works fine for me now. I wonder if I need to got up to 2M to test this properly.

@RomanVeretenov
Copy link
Author

Sorry, totally out of time now. Will try to check it, but now I have most of small files zipped (there can be several k files in one zip) and I keep these zips under dvc. So I should reorganize the storage and test. I think you may close this issue now, I will let you know if something goes wrong.

@efiop efiop closed this as completed Feb 19, 2020
@efiop
Copy link
Contributor

efiop commented Feb 19, 2020

Ok, closing for now. Please feel free to reopen if the issue persists. Thanks for the feedback! 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :) bug Did we break something? p1-important Important, aka current backlog of things to do
Projects
None yet
Development

No branches or pull requests

4 participants