Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud versioning: imports #8789

Closed
2 tasks done
dberenbaum opened this issue Jan 10, 2023 · 17 comments · Fixed by #9246
Closed
2 tasks done

cloud versioning: imports #8789

dberenbaum opened this issue Jan 10, 2023 · 17 comments · Fixed by #9246
Assignees
Labels
A: cloud-versioning Related to cloud-versioned remotes p2-medium Medium priority, should be done, but less important

Comments

@dberenbaum
Copy link
Collaborator

dberenbaum commented Jan 10, 2023

  • For chained import-url imports I think it will not work right now unless the file has been pushed to a DVC remote, but when we get to refactoring regular cached objects to use dvc-data index for push/pull (so everything is unified to use the dvc-data index) this kind of chaining (from original source location) should work out of the box. (import-url: use dvc-data index.save() for fetching imports #8249 (comment)_)
  • Add support to import/get from another repo where that repo's remote is cloud-versioned.
@dberenbaum dberenbaum added p3-nice-to-have It should be done this or next sprint A: cloud-versioning Related to cloud-versioned remotes labels Jan 10, 2023
@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important and removed p3-nice-to-have It should be done this or next sprint labels Jan 31, 2023
@dberenbaum
Copy link
Collaborator Author

Discussed a use case for this with @mnrozhkov. They have two repos:

  1. Data registry: They have a cloud-versioned data registry generated by extracting frames from videos. It must be cloud-versioned because an external annotations tool needs to be able to access the data.
  2. Annotation and training pipeline: Then they have an annotation and training pipeline downstream where they need to import this data, and they don't want a duplicate copy of the data on cloud.

They will try to use import-url --version-aware for this, which should allow them to track and pull the correct versions of data into the annotation and training pipeline. We will see if this is enough before prioritizing this feature.

There is one aspect of import they miss by using import-url: they would like to be able to tag data versions in the data registry and connect those tags to the downstream annotation and training pipeline. AFAIK the only workaround is to tag the data versions during the import-url step in the annotation and training pipeline. This might be enough, depending on whether it's important to connect that back to the version of the data registry that generated the data.

@dberenbaum
Copy link
Collaborator Author

This is estimated to take at least a sprint, so let's wait and see the demand for it. In the meantime, let's try to fail and say it's not supported.

@pmrowla
Copy link
Contributor

pmrowla commented Feb 7, 2023

On the error handling, we probably just need to check if the remote is version_aware/worktree in repo dependency object collection:

odb = repo.cloud.get_remote_odb()

@dberenbaum
Copy link
Collaborator Author

@daavoo You are assigned here. Do you plan to add the exception that import is not supported for cloud versioning?

@dberenbaum dberenbaum changed the title cloud versioning: chained imports cloud versioning: imports Mar 10, 2023
@dberenbaum
Copy link
Collaborator Author

@pmrowla I'm lost on how to fix the error handling here, but a user ran into this in discord. Would you be able to add the error handling 🙏 ?

@dberenbaum
Copy link
Collaborator Author

@efiop Do you want to take this one based on #9182 (comment), or is it premature?

@efiop
Copy link
Contributor

efiop commented Mar 15, 2023

Yeah, taking this over. Support should land soon.

@efiop efiop self-assigned this Mar 15, 2023
efiop added a commit to efiop/dvc that referenced this issue Mar 15, 2023
If we are asked to open a potentially unitialized repo, we need to make sure
it is not a bare git repo, so that we don't accidentally start putting dvc
files inside git control dir.

Discovered while working on iterative#8789
efiop added a commit that referenced this issue Mar 16, 2023
If we are asked to open a potentially unitialized repo, we need to make sure
it is not a bare git repo, so that we don't accidentally start putting dvc
files inside git control dir.

Discovered while working on #8789
@efiop
Copy link
Contributor

efiop commented Mar 21, 2023

For now have to consolidate external_repo stuff and Repo.open to have all the local remote config magic that we have in external repo work in Repo.open and so work with dvcfs

efiop added a commit to efiop/dvc that referenced this issue Mar 29, 2023
efiop added a commit to efiop/dvc that referenced this issue Mar 29, 2023
efiop added a commit to efiop/dvc that referenced this issue Mar 29, 2023
efiop added a commit to efiop/dvc-data that referenced this issue Mar 30, 2023
Happens when we've backed out our data to data storage but we are not
sure if it is a directory or a file.

Related iterative/dvc#8789
efiop added a commit to efiop/dvc-data that referenced this issue Mar 31, 2023
Happens when we've backed out our data to data storage but we are not
sure if it is a directory or a file.

Related iterative/dvc#8789
efiop added a commit to efiop/dvc-data that referenced this issue Mar 31, 2023
Happens when we've backed out our data to data storage but we are not
sure if it is a directory or a file.

Related iterative/dvc#8789
efiop added a commit to iterative/dvc-data that referenced this issue Mar 31, 2023
Happens when we've backed out our data to data storage but we are not
sure if it is a directory or a file.

Related iterative/dvc#8789
efiop added a commit to efiop/dvc that referenced this issue Apr 1, 2023
efiop added a commit to efiop/dvc that referenced this issue Apr 1, 2023
efiop added a commit to efiop/dvc that referenced this issue Apr 1, 2023
Makes dvc get/import/etc use dvcfs, which already supports cloud versioning,
circular imports and other stuff.

Also makes dvc import behave more like dvc import-url, so that we can use
the same existing logic for fetching those using index instead of objects.

Fixes iterative#8789
Related iterative/studio#4782
efiop added a commit to efiop/dvc that referenced this issue Apr 2, 2023
Makes dvc get/import/etc use dvcfs, which already supports cloud versioning,
circular imports and other stuff.

Also makes dvc import behave more like dvc import-url, so that we can use
the same existing logic for fetching those using index instead of objects.

Fixes iterative#8789
Related iterative/studio#4782
efiop added a commit that referenced this issue Apr 2, 2023
Makes dvc get/import/etc use dvcfs, which already supports cloud versioning,
circular imports and other stuff.

Also makes dvc import behave more like dvc import-url, so that we can use
the same existing logic for fetching those using index instead of objects.

Fixes #8789
Related iterative/studio#4782
@efiop
Copy link
Contributor

efiop commented Apr 2, 2023

I still owe a test for this though. Will add some in a moment...

@efiop efiop reopened this Apr 2, 2023
@efiop
Copy link
Contributor

efiop commented Apr 2, 2023

After taking another look, I don't think it is worth it, as it works through the same mechanisms that import/get/ls[-url] rely on. Will add explicit ones later on if there is need.

@efiop efiop closed this as completed Apr 2, 2023
@dberenbaum dberenbaum reopened this Jun 5, 2023
@dberenbaum
Copy link
Collaborator Author

Reopening per slack thread in https://iterativeai.slack.com/archives/CB41NAL8H/p1685768408126399.

@efiop Do you recall if this was supposed to be closed and fully implemented? It looks that way, but I don't see any comments about it and can't remember the specifics.

It doesn't seem to be working, and I got the same error trying a few different versions, so I'm not sure if it's a regression or it never worked. Here's what I get (it should be possible for anyone to reproduce from an empty repo if you have your credentials configured for the aws sandbox):

$ dvc import -v [email protected]:iterative/example-registry-cloud-versioned.git data
2023-06-05 16:05:09,076 DEBUG: v3.0.0a2.dev11+g78637713c, CPython 3.10.10 on macOS-13.3.1-arm64-arm-64bit
2023-06-05 16:05:09,077 DEBUG: command: /Users/dave/miniforge3/envs/dvc/bin/dvc import -v [email protected]:iterative/example-registry-cloud-versioned.git data
2023-06-05 16:05:09,498 DEBUG: Removing output 'data' of stage: 'data.dvc'.
2023-06-05 16:05:09,499 DEBUG: Removing '/Users/dave/repo/data'
Importing 'data ([email protected]:iterative/example-registry-cloud-versioned.git)' -> 'data'
2023-06-05 16:05:09,500 DEBUG: Computed stage: 'data.dvc' md5: 'eed83ea4c6f2cdb4856c4f5e0edd17d5'
2023-06-05 16:05:09,500 DEBUG: 'md5' of stage: 'data.dvc' changed.
2023-06-05 16:05:09,501 DEBUG: Creating external repo [email protected]:iterative/example-registry-cloud-versioned.git@None
2023-06-05 16:05:09,501 DEBUG: erepo: git clone '[email protected]:iterative/example-registry-cloud-versioned.git' to a temporary dir
2023-06-05 16:05:14,332 ERROR: unexpected error
Traceback (most recent call last):
  File "/Users/dave/Code/dvc/dvc/cli/__init__.py", line 210, in main
    ret = cmd.do_run()
  File "/Users/dave/Code/dvc/dvc/cli/command.py", line 26, in do_run
    return self.run()
  File "/Users/dave/Code/dvc/dvc/commands/imp.py", line 17, in run
    self.repo.imp(
  File "/Users/dave/Code/dvc/dvc/repo/imp.py", line 6, in imp
    return self.imp_url(path, out=out, fname=fname, erepo=erepo, frozen=True, **kwargs)
  File "/Users/dave/Code/dvc/dvc/repo/__init__.py", line 64, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/dave/Code/dvc/dvc/repo/scm_context.py", line 151, in run
    return method(repo, *args, **kw)
  File "/Users/dave/Code/dvc/dvc/repo/imp_url.py", line 86, in imp_url
    stage.run(jobs=jobs, no_download=no_download)
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
    return call()
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 586, in run
    self._sync_import(dry, force, kwargs.get("jobs", None), no_download)
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
    return call()
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
  File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 623, in _sync_import
    sync_import(self, dry, force, jobs, no_download)
  File "/Users/dave/Code/dvc/dvc/stage/imports.py", line 62, in sync_import
    stage.deps[0].download(
  File "/Users/dave/Code/dvc/dvc/dependency/base.py", line 53, in download
    fs_download(self.fs, self.fs_path, to, jobs=jobs)
  File "/Users/dave/Code/dvc/dvc/fs/__init__.py", line 56, in download
    fs.get(fs_path, to.fs_path, batch_size=jobs, callback=cb)
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 660, in get
    list(executor.imap_unordered(get_file, from_infos, to_infos))
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/dvc_objects/executors.py", line 56, in imap_unordered
    yield fut.result()
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/dvc_objects/fs/callbacks.py", line 69, in func
    return wrapped(path1, path2, **kw)
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/dvc_objects/fs/callbacks.py", line 41, in wrapped
    res = fn(*args, **kwargs)
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 624, in get_file
    self.fs.get_file(rpath, lpath, **kwargs)
  File "/Users/dave/Code/dvc/dvc/fs/dvc.py", line 395, in get_file
    return dvc_fs.get_file(dvc_path, lpath, **kwargs)
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 550, in get_file
    self.fs.get_file(from_info, to_info, callback=callback, **kwargs)
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/dvc_data/fs.py", line 127, in get_file
    _, storage, fs, path = self._get_fs_path(rpath)
  File "/Users/dave/miniforge3/envs/dvc/lib/python3.10/site-packages/dvc_data/fs.py", line 69, in _get_fs_path
    raise FileNotFoundError
FileNotFoundError

2023-06-05 16:05:14,357 DEBUG: Version info for developers:
DVC version: 3.0.0a2.dev11+g78637713c
-------------------------------------
Platform: Python 3.10.10 on macOS-13.3.1-arm64-arm-64bit
Subprojects:
        dvc_data = 0.54.3
        dvc_objects = 0.22.0
        dvc_render = 0.3.1
        dvc_task = 0.2.1
        scmrepo = 1.0.3
Supports:
        azure (adlfs = 2023.4.0, knack = 0.10.1, azure-identity = 1.12.0),
        gdrive (pydrive2 = 1.15.3),
        gs (gcsfs = 2022.11.0),
        hdfs (fsspec = 2022.11.0, pyarrow = 11.0.0),
        http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = 2022.11.0, boto3 = 1.24.59),
        ssh (sshfs = 2023.4.1),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2022.11.0)
Config:
        Global: /Users/dave/Library/Application Support/dvc
        System: /Library/Application Support/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: local
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/6414c2ff59800d13bb7dc40946807396

@efiop
Copy link
Contributor

efiop commented Jun 7, 2023

@dberenbaum Yeah, should be working. Taking a look right now...

EDIT: can reproduce, taking a look...

@efiop efiop mentioned this issue Jun 7, 2023
efiop added a commit to efiop/dvc that referenced this issue Jun 7, 2023
@efiop
Copy link
Contributor

efiop commented Jun 7, 2023

So indeed there is a bug in s3fs where we try to use _ls_from_cache with a versioned path and just error out if it is not there (and versioned paths are not supported by dircache anyway). Will prepare a proper fix.

@efiop
Copy link
Contributor

efiop commented Jun 7, 2023

@dberenbaum If you have time, feel free to give this a try:

pip install git+https://github.com/efiop/s3fs@info-exists-versioned

Fixes for me locally and in tests. Will create a monkeypatch for dvc-s3 and a proper one for s3fs tomorrow.

@dberenbaum
Copy link
Collaborator Author

Works for me, thanks for the quick fix!

@efiop
Copy link
Contributor

efiop commented Jun 8, 2023

For the record: fsspec/s3fs#746

@efiop
Copy link
Contributor

efiop commented Jun 8, 2023

Closing this issue since the last problem is just an s3fs bug and not really in the stuff we've implemented for cloud versioning and imports.

@efiop efiop closed this as completed Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: cloud-versioning Related to cloud-versioned remotes p2-medium Medium priority, should be done, but less important
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants