Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push: unexpected error - 'NoneType' object is not iterable #29

Closed
antortjim opened this issue Jun 14, 2023 · 21 comments · Fixed by #30
Closed

push: unexpected error - 'NoneType' object is not iterable #29

antortjim opened this issue Jun 14, 2023 · 21 comments · Fixed by #30
Assignees
Labels
p0-critical Handle immediately

Comments

@antortjim
Copy link

Bug Report

Description

I just installed dvc 3.0.0 in a mamba environment in an existing git repository. I wish to upload the large files to a google drive repository

pip install dvc[gdrive]
dvc remote add origin gdrive://XXX
# where X matches what I copied from the google drive URL after i.e. https://drive.google.com/drive/u/0/folders/XXX
dvc remote list
# origin	gdrive://XXX 
dvc --version
# 3.0.0

Reproduce

dvc init
dvc add data/file
git add data/file.dvc
dvc commit
git commit -m "Add data"
git push
dvc push

Expected

I expected to be taken to the Google login page or maybe that it would simply work, since I already have another environment with dvc working and able to push/pull from another remote (it's a separate project). Instead I get this error

2023-06-14 14:50:48,901 DEBUG: v3.0.0 (pip), CPython 3.10.10 on Linux-5.19.0-43-generic-x86_64-with-glibc2.35
2023-06-14 14:50:48,901 DEBUG: command: /home/antortjim/mambaforge/bin/dvc push -vv
2023-06-14 14:50:48,901 TRACE: Namespace(cprofile=False, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, cprofile_dump=None, pdb=False, instrument=False, instrument_open=False, show_stack=False, quiet=0, verbose=2, cd='.', cmd='push', jobs=None, targets=[], remote=None, all_branches=False, all_tags=False, all_commits=False, with_deps=False, recursive=False, run_cache=False, glob=False, func=<class 'dvc.commands.data_sync.CmdDataPush'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2023-06-14 14:50:49,051 TRACE:   800.44 mks in collecting stages from /home/antortjim/opt/deconvolution
2023-06-14 14:50:49,051 TRACE:     3.13 mks in collecting stages from /home/antortjim/opt/deconvolution/PSF
2023-06-14 14:50:49,058 TRACE:     7.08 ms in collecting stages from /home/antortjim/opt/deconvolution/data
2023-06-14 14:50:49,063 DEBUG: Preparing to transfer data from '/home/antortjim/opt/deconvolution/.dvc/cache/files/md5' to '1kctIk_rDD2lNWqSlp4OHoPes4RU2itIT/files/md5'
2023-06-14 14:50:49,063 DEBUG: Preparing to collect status from '1kctIk_rDD2lNWqSlp4OHoPes4RU2itIT/files/md5'
2023-06-14 14:50:49,063 DEBUG: Collecting status from '1kctIk_rDD2lNWqSlp4OHoPes4RU2itIT/files/md5'
2023-06-14 14:50:49,064 DEBUG: Querying 1 oids via object_exists                                                                                                                                                                                                                  
2023-06-14 14:50:49,802 DEBUG: Preparing to collect status from '/home/antortjim/opt/deconvolution/.dvc/cache/files/md5'                                                                                                                                                        
2023-06-14 14:50:49,802 DEBUG: Collecting status from '/home/antortjim/opt/deconvolution/.dvc/cache/files/md5'
2023-06-14 14:50:50,058 ERROR: unexpected error - 'NoneType' object is not iterable                                                                                                                                                                                               
Traceback (most recent call last):
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc/cli/__init__.py", line 209, in main
    ret = cmd.do_run()
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc/cli/command.py", line 26, in do_run
    return self.run()
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc/commands/data_sync.py", line 60, in run
    processed_files_count = self.repo.push(
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc/repo/__init__.py", line 64, in wrapper
    return f(repo, *args, **kwargs)
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc/repo/push.py", line 92, in push
    result = self.cloud.push(
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc/data_cloud.py", line 196, in push
    t, f = self._push(default_objs, jobs=jobs, odb=odb)
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc/data_cloud.py", line 212, in _push
    return self.transfer(
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc/data_cloud.py", line 167, in transfer
    return transfer(src_odb, dest_odb, objs, **kwargs)
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc_data/hashfile/transfer.py", line 229, in transfer
    failed = _do_transfer(
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc_data/hashfile/transfer.py", line 123, in _do_transfer
    failed_ids.update(_add(src, dest, all_file_ids, **kwargs))
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc_data/hashfile/transfer.py", line 167, in _add
    dest.add(
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc_data/hashfile/db/__init__.py", line 113, in add
    transferred = super().add(
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc_objects/db.py", line 162, in add
    self._init(parts)
  File "/home/antortjim/mambaforge/lib/python3.10/site-packages/dvc_objects/db.py", line 65, in _init
    self._dirs = {
TypeError: 'NoneType' object is not iterable

2023-06-14 14:50:50,076 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2023-06-14 14:50:50,076 DEBUG: Removing '/home/antortjim/opt/.RMbpX96p6HkMzrRtrEguyL.tmp'
2023-06-14 14:50:50,076 DEBUG: Removing '/home/antortjim/opt/.RMbpX96p6HkMzrRtrEguyL.tmp'
2023-06-14 14:50:50,076 DEBUG: Removing '/home/antortjim/opt/.RMbpX96p6HkMzrRtrEguyL.tmp'
2023-06-14 14:50:50,076 DEBUG: Removing '/home/antortjim/opt/deconvolution/.dvc/cache/files/md5/.HHzGBVrfhWWYpBUXh2SU5i.tmp'
2023-06-14 14:50:50,083 DEBUG: Version info for developers:
DVC version: 3.0.0 (pip)
------------------------
Platform: Python 3.10.10 on Linux-5.19.0-43-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 1.11.0
	dvc_objects = 0.23.0
	dvc_render = 0.5.3
	dvc_task = 0.3.0
	scmrepo = 1.0.3
Supports:
	gdrive (pydrive2 = 1.15.4),
	http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3)
Config:
	Global: /home/antortjim/.config/dvc
	System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: gdrive
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/239f50ebd7643581b32ee9a8093ef564

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2023-06-14 14:50:50,084 DEBUG: Analytics is enabled.
2023-06-14 14:50:50,108 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpdvdwsc_a']'
2023-06-14 14:50:50,109 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpdvdwsc_a']'

Environment information

Ubuntu 22.04.2 LTS
Python 3.10 (Mamba environment)
dvc 3.0.0
gdrive 0.1.5

Output of dvc doctor:

$ dvc doctor

DVC version: 3.0.0 (pip)
------------------------
Platform: Python 3.10.10 on Linux-5.19.0-43-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 1.11.0
	dvc_objects = 0.23.0
	dvc_render = 0.5.3
	dvc_task = 0.3.0
	scmrepo = 1.0.3
Supports:
	gdrive (pydrive2 = 1.15.4),
	http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3)
Config:
	Global: /home/vibflysleep/.config/dvc
	System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: gdrive
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/239f50ebd7643581b32ee9a8093ef564

Additional Information (if any):

@satyajitghana
Copy link

satyajitghana commented Jun 14, 2023

same

  File "/workspace/.pyenv_mirror/user/current/lib/python3.11/site-packages/dvc_data/hashfile/transfer.py", line 92, in _do_transfer
    dir_fails = _add(src, dest, bound_file_ids, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.pyenv_mirror/user/current/lib/python3.11/site-packages/dvc_data/hashfile/transfer.py", line 167, in _add
    dest.add(
  File "/workspace/.pyenv_mirror/user/current/lib/python3.11/site-packages/dvc_data/hashfile/db/__init__.py", line 113, in add
    transferred = super().add(
                  ^^^^^^^^^^^^
  File "/workspace/.pyenv_mirror/user/current/lib/python3.11/site-packages/dvc_objects/db.py", line 162, in add
    self._init(parts)
  File "/workspace/.pyenv_mirror/user/current/lib/python3.11/site-packages/dvc_objects/db.py", line 65, in _init
    self._dirs = {
                 ^
TypeError: 'NoneType' object is not iterable

@daavoo
Copy link
Contributor

daavoo commented Jun 14, 2023

Can reproduce.

@daavoo daavoo added bug Something isn't working and removed bug Something isn't working labels Jun 14, 2023
@dberenbaum
Copy link

@daavoo daavoo self-assigned this Jun 15, 2023
@daavoo
Copy link
Contributor

daavoo commented Jun 15, 2023

I am taking a look.

Introduced in iterative/dvc#9538 .

In all the other remote filesystems, we expect an error to be raised and ignored when checking for /files/md5 dir and it doesn't exist :

https://github.com/iterative/dvc-objects/blame/7e07ccf42cf85f9fbea6b142a0609d796cba0539/src/dvc_objects/db.py#L62-L68

So the dir files/md5 gets created after the check.

However in GdriveFilesystem, None is returned instead, causing an unexpected exception:

https://github.com/iterative/PyDrive2/blob/a5dc1d9a4da73f8b4172c1020726bd457fb62213/pydrive2/fs/spec.py#L436-L437

@rnoxy
Copy link

rnoxy commented Jun 15, 2023

I am pretty sure that this is the issue becaus of wrong remote location.
In dvc 3.0 the location is remote:/file/md5.

I am also experiencing similar issue with FileNotFound.
In my opinion, the reason is that

  1. dvc add + dvc push are using new location file/md5
  2. dvc get + dvc pull are not :-(

@dberenbaum
Copy link

To an update here: you can track the progress in the linked issue in iterative/PyDrive2#283 and we hope to get it resolved by early next week.

@efiop efiop transferred this issue from iterative/dvc Jun 20, 2023
@satyajitghana
Copy link

You can manually go and create folders files/md5. It should fix the error. It's just a one time thing.

@rnoxy
Copy link

rnoxy commented Jun 21, 2023

@satyajitghana ,
but I am afraid the issue refers to dvc-s3, as well.

@efiop
Copy link
Contributor

efiop commented Jun 21, 2023

@rnoxy Please post full verbose log for the error you are getting.

@rnoxy
Copy link

rnoxy commented Jun 21, 2023

I have used recent [email protected] (ubuntu-latest and macos, installed with pip) in order to push the file to dvc storage (s3 bucket).

I have observed that dvc add and dvc push created the file in
s3://bucket/remote/files/md5/ec/656168164c57bb3ec3551e10b2f4cc

Here is some example log (unfortunatelly I cannot give more details):

❯ dvc --version                                                                                                     
3.1.0

❯ git checkout [email protected]  # tag registered with gto                                                                                                     
❯ dvc pull datasets/dataset.tar

Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5-dos2unix: ec656168164c57bb3ec3551e10b2f4cc
A       /Users/rno/repo/datasets/dataset.tar
1 file added
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/Users/rno/repo/datasets/dataset.tar

Is your cache up to date?

So the output the "1 file added" is not true :-(
Such file was not added actually.

When I try to use dvc get I see FileNotFound error

but I have double checked that the s3://bucket/remote/files/md5/ec/656168164c57bb3ec3551e10b2f4cc
exists. I can copy it to my local machine with aws s3 cp command.

Here are some logs from dvc get

❯ dvc get --verbose $REPO datasets/dataset.tar --rev "[email protected]" -o a.tar                                                               
2023-06-21 23:09:57,938 DEBUG: v3.1.0 (pip), CPython 3.9.13 on macOS-13.5-arm64-arm-64bit
2023-06-21 23:09:57,938 DEBUG: command: .../bin/dvc get --verbose [email protected]:XXXXX/XXXXX.git datasets/dataset.tar --rev [email protected] -o a.tar
2023-06-21 23:09:58,096 DEBUG: Creating external repo [email protected]:XXX/XXX.git@[email protected]
2023-06-21 23:09:58,096 DEBUG: erepo: git clone '[email protected]:XXX/XXX.git' to a temporary dir
2023-06-21 23:10:05,546 ERROR: unexpected error - [Errno 2] No storage files available: 'datasets/dataset.tar'

Traceback (most recent call last):
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc/cli/__init__.py", line 209, in main
    ret = cmd.do_run()
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc/cli/command.py", line 40, in do_run
    return self.run()
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc/commands/get.py", line 26, in run
    return self._get_file_from_repo()
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc/commands/get.py", line 33, in _get_file_from_repo
    Repo.get(
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc/repo/get.py", line 52, in get
    fs.get(
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 637, in get
    return get_file(from_info, to_info)
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 69, in func
    return wrapped(path1, path2, **kw)
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc_objects/fs/callbacks.py", line 41, in wrapped
    res = fn(*args, **kwargs)
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 624, in get_file
    self.fs.get_file(rpath, lpath, **kwargs)
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc/fs/dvc.py", line 395, in get_file
    return dvc_fs.get_file(dvc_path, lpath, **kwargs)
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 550, in get_file
    self.fs.get_file(from_info, to_info, callback=callback, **kwargs)
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc_data/fs.py", line 134, in get_file
    _, storage, fs, path = self._get_fs_path(rpath)
  File "/Users/rno/opt/miniconda3/envs/rgb-datasets/lib/python3.9/site-packages/dvc_data/fs.py", line 72, in _get_fs_path
    raise FileNotFoundError(
FileNotFoundError: [Errno 2] No storage files available: 'dataset/dataset.tar'

2023-06-21 23:10:05,589 DEBUG: Version info for developers:
DVC version: 3.1.0 (pip)
------------------------
Platform: Python 3.9.13 on macOS-13.5-arm64-arm-64bit
Subprojects:
	dvc_data = 2.0.2
	dvc_objects = 0.23.0
	dvc_render = 0.5.3
	dvc_task = 0.3.0
	scmrepo = 1.0.4
Supports:
	http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2023.6.0, boto3 = 1.26.76)
Config:
	Global: /Users/rno/Library/Application Support/dvc
	System: /Library/Application Support/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/1e9dd2ca1eca184a4f4c160d640c6145

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

@rnoxy
Copy link

rnoxy commented Jun 21, 2023

@efiop ,

let me add that when I tried to copy the file to the old location (v2 format)

aws s3 cp s3://bucket/remote/files/md5/ec/656168164c57bb3ec3551e10b2f4cc s3://bucket/remote/ec/656168164c57bb3ec3551e10b2f4cc

i.e. without files/md5, then dvc pull worked without any issues, so the file was downloaded.

I think the dvc 3.1.0 is broken, still.
The add/push sends to files/md5/ location, but get/pull expects the old format location.

@rnoxy
Copy link

rnoxy commented Jun 21, 2023

@efiop , I am not 100% sure, but probably this is the problem (see dvc-data package):
https://github.com/iterative/dvc-data/blob/main/src/dvc_data/hashfile/db/local.py#LL46C35-L46C35
The path is not consistent with files/md5/****

@rnoxy
Copy link

rnoxy commented Jun 21, 2023

@pmrowla
Copy link
Contributor

pmrowla commented Jun 22, 2023

the files/md5 prefix is handled at the DVC level, not in dvc-objects. The 3.x ODB is rooted at /files/md5, not /, so the existing oid to path logic is correct

@dberenbaum dberenbaum added the p0-critical Handle immediately label Jun 22, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Jun 22, 2023

@rnoxy can you share the contents of the .dvc file for your dataset in the tag you are trying to pull?

i.e. something along the lines of

git checkout [email protected]
cat datasets/dataset.tar.dvc

This error:

Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5-dos2unix: ec656168164c57bb3ec3551e10b2f4cc

indicates that the specified dataset was created in DVC 2.x (since it is reported as an md5-dos2unix hash). In this case, this file will not be pushed/pulled to the 3.0 cache location. It will only ever be pushed/pulled to the 2.x location (without files/md5).

It sounds like the problem may be that the specified file at that git tag was just not pushed to the remote in the first place.

@pmrowla
Copy link
Contributor

pmrowla commented Jun 22, 2023

I have observed that dvc add and dvc push created the file in
s3://bucket/remote/files/md5/ec/656168164c57bb3ec3551e10b2f4cc

Can you also confirm whether or not the tag [email protected] points to the git commit made after this DVC 3.1 dvc add and dvc push? Or was the tag created pointing to a DVC 2.x commit before you re-added/pushed in DVC 3.1?

(This would also cause the behavior you are seeing here, where you have pushed the file to 3.0 cache, but the tag still points to the 2.x version of a .dvc file which was never pushed to 2.x cache)

@rnoxy
Copy link

rnoxy commented Jun 22, 2023

cat datasets/dataset.tar.dvc

outs:
- md5: ec656168164c57bb3ec3551e10b2f4cc
  size: 573440
  path: dataset.tar

@rnoxy
Copy link

rnoxy commented Jun 22, 2023

This error:

Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5-dos2unix: ec656168164c57bb3ec3551e10b2f4cc

indicates that the specified dataset was created in DVC 2.x (since it is reported as an md5-dos2unix hash). In this case, this file will not be pushed/pulled to the 3.0 cache location. It will only ever be pushed/pulled to the 2.x location (without files/md5).

No, the error means the problem with current dvc v3.1.0 which searches for the file
s3://bucket/remote/ec/656168164c57bb3ec3551e10b2f4cc
and not the file s3://bucket/remote/files/md5/ec/656168164c57bb3ec3551e10b2f4cc, which was added by dvc before.

See my note that I copied the file from file/md5/* location to / so it was in good location. After moving to deprecated path /, the dvc v3.1 started loading the file, which means that dvc get/pull is still wrong.

@pmrowla
Copy link
Contributor

pmrowla commented Jun 23, 2023

cat datasets/dataset.tar.dvc

outs:
- md5: ec656168164c57bb3ec3551e10b2f4cc
  size: 573440
  path: dataset.tar

This is a DVC 2.x output, meaning DVC (even in 3.0) will push and pull it to the legacy remote location. DVC 3.x data has an additional hash: md5 field in the output section of the DVC file.

No, the error means the problem with current dvc v3.1.0 which searches for the file
s3://bucket/remote/ec/656168164c57bb3ec3551e10b2f4cc
and not the file s3://bucket/remote/files/md5/ec/656168164c57bb3ec3551e10b2f4cc, which was added by dvc before.

It sounds like there is a bit of confusion over the deprecated behavior in 3.x. DVC 3.x still uses the legacy cache/remote location (without files/md5) for pre-existing data that was tracked in DVC 2.x. The new files/md5 location is only used for completely new data that has been added in 3.x, or for existing data that has actually been modified after you upgrade to 3.x.

As long as dataset.tar has not been modified since it was added in DVC 2.x, DVC will still use the legacy cache/remote location. In this case, that means DVC will try to pull it from s3://bucket/remote/ec/656168164c57bb3ec3551e10b2f4cc. It still looks to me like it was never pushed to the 2.x remote location in the first place, which is why it works after you manually copied the file into the 2.x remote location.

@skshetry
Copy link
Member

@antortjim, can you please try installing pydrive2==1.16 and see if it fixes the issue for you? Thanks.

@github-project-automation github-project-automation bot moved this from In Progress to Done in DVC Jun 23, 2023
@antortjim
Copy link
Author

First of all, thank you all for looking into the issue
@skshetry I checked and I had pydrive2 1.15.4 installed. I ran the following

pip install pydrive2==1.16
dvc push

I confirm I don't get the error and the files I added with dvc add + dvc commit appear in my remote google drive folder. So the bug seems to be solved on my side!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p0-critical Handle immediately
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

8 participants