Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import-url: use dvc-data index.save() for fetching imports #8249

Merged
merged 6 commits into from
Oct 19, 2022

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Sep 6, 2022

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

  • Implements cloud versioning support in import-url
    • When the imported URL contains a version ID, DVC will import that specific version (and the version ID will be capture din the .dvc file).
    • When the imported URL does not contain a version ID, DVC will not capture version information unless the --version-aware flag is provided (regardless of whether or not the bucket/container has versioning enabled).
    • When using --version-aware flag, DVC will capture the latest/current version of the imported file.
  • Implements support for fetching imports from source
    • DVC will first try to fetch/pull imports from a DVC remote
    • If there is no remote configured or the imported object failed to be fetched, DVC will fetch from the original source location
    • When fetching from source, DVC will verify that metadata (etag/versionid/etc) still matches the original import metadata. If the source has changed, DVC will error out with DataSourceChangedError (same as when import-url --no-download data source has changed)
    • Closes Pull imports from sourceΒ #8274
  • Fixes import-url/fetch: partial imports should only be fetchable in workspaceΒ #8435

@pmrowla pmrowla self-assigned this Sep 6, 2022
@pmrowla pmrowla added the A: data-sync Related to dvc get/fetch/import/pull/push label Sep 6, 2022
@pmrowla pmrowla force-pushed the import-url-version-2 branch 2 times, most recently from 1fb0279 to c0008ac Compare September 21, 2022 08:44
@pmrowla pmrowla force-pushed the import-url-version-2 branch 4 times, most recently from a112ab8 to 7220b8c Compare October 13, 2022 11:01
@@ -514,6 +515,23 @@ def index_key(self) -> Tuple[str, "DataIndexKey"]:
key = self.fs.path.parts(no_drive)[1:]
return workspace, key

def get_entry(self) -> "DataIndexEntry":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this should be backwards: data index is getting filled up first and then output gets info from it. SInce we want to work directly with index during operations (e.g. add/checkout/etc), it means that it should be the source of truth that is used during serializing into dvcfiles. Obviously we could try to propagate the changes back into outputs (like we do right now in cloud versioning), but I wonder if that's actually a wrong way to approach it. WDYT?

@dberenbaum
Copy link
Collaborator

Some product questions:

  • If I have myazureremote configured with version_aware = true and do dvc import-url myazureremote://path, will it capture the version id?
  • If I do dvc import-url azure://container/path on a cloud-versioned container, will/can it capture the version id?
  • If not, can there be a flag to capture the version id without having to specify it in the path?

@pmrowla
Copy link
Contributor Author

pmrowla commented Oct 19, 2022

Some product questions:

  • If I have myazureremote configured with version_aware = true and do dvc import-url myazureremote://path, will it capture the version id?

Yes

  • If I do dvc import-url azure://container/path on a cloud-versioned container, will/can it capture the version id?
  • If not, can there be a flag to capture the version id without having to specify it in the path?

By default it will not automatically capture the version ID on a versioned container when the URL does not contain a version. But this PR adds --version-aware flag to import-url. So you can get the desired behavior with:

dvc import-url --version-aware azure://container/path

The end result here would be that DVC imports the latest/current version of container/path (and the version ID would be captured in the .dvc file)

@pmrowla pmrowla changed the title [WIP] import-url: use dvc-data index.save() for fetching imports import-url: use dvc-data index.save() for fetching imports Oct 19, 2022
@pmrowla pmrowla marked this pull request as ready for review October 19, 2022 08:54
@pmrowla
Copy link
Contributor Author

pmrowla commented Oct 19, 2022

@dberenbaum I think the last open question for this was whether to re-use --rev for dvc update or add a separate flag for version IDs (the current implementation just uses --rev)

@efiop efiop merged commit a0e45cd into iterative:main Oct 19, 2022
@pmrowla pmrowla deleted the import-url-version-2 branch October 20, 2022 05:48
@dberenbaum
Copy link
Collaborator

dberenbaum commented Oct 21, 2022

I seem to be missing something with pulling from source or else it's not working as expected. Here's a repro script:

echo foo > foo.txt
rm -rf repo
git init -q repo
cd repo
dvc init -q
dvc import-url -q ../foo.txt
rm foo.txt
rm -rf .dvc/cache
dvc pull foo.txt

Pull fails for me when running it.

Other comments:

  • Is there a docs issue for this?
  • It looks like it downloads again even when there is no update available. Can we avoid it?
  • For clarification, is it intended that the version ID is saved in both the version_id field and the query string of the path field?
  • Edit: should fetch/pull work for chained imports, like dvc import [email protected]:... data.csv where data.csv was the target of import-url and exists only in the source location, not the default remote for the upstream repo?

@pmrowla
Copy link
Contributor Author

pmrowla commented Oct 24, 2022

That script works for me in 2.31.0. For the final command I get

$ dvc pull foo.txt
A       foo.txt
1 file added and 1 file fetched

There is no docs issue yet, I'll open a PR.

It looks like it downloads again even when there is no update available. Can we avoid it?

For this do you mean on dvc pull or on dvc update? I get Everything is up to date. on pull. For dvc update we can probably get around downloading when it hasn't changed, but that's a separate feature request I think (the existing behavior for dvc update was to just always download import-url files on update).

For chained import-url imports I think it will not work right now unless the file has been pushed to a DVC remote, but when we get to refactoring regular cached objects to use dvc-data index for push/pull (so everything is unified to use the dvc-data index) this kind of chaining (from original source location) should work out of the box.

@dberenbaum
Copy link
Collaborator

That script works for me in 2.31.0

πŸ€” Here's the verbose output from dvc pull foo.txt and dvc doctor:

2022-10-24 10:13:18,162 DEBUG: Checking if stage 'foo.txt' is in 'dvc.yaml'
2022-10-24 10:13:18,201 DEBUG: Checking if stage 'foo.txt' is in 'dvc.yaml'
2022-10-24 10:13:18,203 DEBUG: Removing '/private/tmp/repo/.8c5fuLs949e9tmkPwaUVem.tmp'
2022-10-24 10:13:18,203 DEBUG: Removing '/private/tmp/repo/.8c5fuLs949e9tmkPwaUVem.tmp'
2022-10-24 10:13:18,203 DEBUG: Removing '/private/tmp/repo/.dvc/cache/.No3WJeGd8scdm9Q9DW8BBT.tmp'
1 file failed
2022-10-24 10:13:18,204 ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/private/tmp/repo/foo.txt
Is your cache up to date?
<https://error.dvc.org/missing-files>
------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dave/Code/dvc/dvc/commands/data_sync.py", line 31, in run
    stats = self.repo.pull(
  File "/Users/dave/Code/dvc/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/dave/Code/dvc/dvc/repo/pull.py", line 46, in pull
    stats = self.checkout(
  File "/Users/dave/Code/dvc/dvc/repo/__init__.py", line 48, in wrapper
    return f(repo, *args, **kwargs)
  File "/Users/dave/Code/dvc/dvc/repo/checkout.py", line 113, in checkout
    raise CheckoutError(stats["failed"], stats)
dvc.exceptions.CheckoutError: Checkout failed for following targets:
/private/tmp/repo/foo.txt
Is your cache up to date?
<https://error.dvc.org/missing-files>
------------------------------------------------------------
2022-10-24 10:13:18,206 DEBUG: Analytics is disabled.
DVC version: 2.31.0
---------------------------------
Platform: Python 3.10.2 on macOS-12.6-arm64-arm-64bit
Subprojects:
        dvc_data = 0.22.0
        dvc_objects = 0.11.0
        dvc_render = 0.0.12
        dvc_task = 0.1.4
        dvclive = 0.12.1
        scmrepo = 0.1.2
Supports:
        azure (adlfs = 2022.9.1, knack = 0.9.0, azure-identity = 1.7.1),
        gdrive (pydrive2 = 1.14.0),
        gs (gcsfs = 2022.10.0),
        hdfs (fsspec = 2022.10.0, pyarrow = 7.0.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = 2022.10.0, boto3 = 1.24.59),
        ssh (sshfs = 2022.6.0),
        webdav (webdav4 = 0.9.4),
        webdavs (webdav4 = 0.9.4),
        webhdfs (fsspec = 2022.10.0)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

I'll add a follow-up ticket to track the other issues.

@dberenbaum
Copy link
Collaborator

One more thing: --version-aware doesn't seem to do anything for directories. I would expect it to include the list of all files and the version_id for each inside the .dvc file.

If I do dvc import-url --version-aware s3://dave-sandbox-versioning/example-cloud-remote/data/, the data.dvc file looks like:

md5: 5e46b5ab89ac35532c82915752d2ddbe
frozen: true
deps:
- md5: 33ab906cdada8d288429ab9884d8e08b.dir
  size: 21
  nfiles: 3
  path: s3://dave-sandbox-versioning/example-cloud-remote/data/
outs:
- md5: 1fff39c58ffb2a046fb72f52eaff02db.dir
  size: 21
  nfiles: 3
  path: data

@pmrowla
Copy link
Contributor Author

pmrowla commented Oct 25, 2022

Might be something off with your virtualenv? In a fresh venv on macos I get

+ echo foo
+ rm -rf repo
+ git init -q repo
+ pushd repo
~/git/scratch/tmp/repo ~/git/scratch/tmp
+ dvc init -q
+ dvc import-url -q ../foo.txt
+ rm foo.txt
+ rm -rf .dvc/cache
+ dvc pull -v foo.txt
2022-10-25 15:20:35,792 DEBUG: Checking if stage 'foo.txt' is in 'dvc.yaml'
2022-10-25 15:20:35,836 DEBUG: Checking if stage 'foo.txt' is in 'dvc.yaml'
2022-10-25 15:20:35,837 DEBUG: Removing '/Users/pmrowla/git/scratch/tmp/repo/.QvH6v6jMaktGCibXonekgW.tmp'
2022-10-25 15:20:35,838 DEBUG: Removing '/Users/pmrowla/git/scratch/tmp/repo/.QvH6v6jMaktGCibXonekgW.tmp'
2022-10-25 15:20:35,838 DEBUG: Removing '/Users/pmrowla/git/scratch/tmp/repo/.dvc/cache/.jYDWSKNNFVMrHBXt9hfp3B.tmp'
A       foo.txt
1 file added and 1 file fetched
2022-10-25 15:20:35,855 DEBUG: Analytics is enabled.
2022-10-25 15:20:35,883 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/var/folders/5q/_pk6g_s12dj2c4kp4mf_fnc40000gn/T/tmpzlyfxezn']'
2022-10-25 15:20:35,885 DEBUG: Spawned '['daemon', '-q', 'analytics', '/var/folders/5q/_pk6g_s12dj2c4kp4mf_fnc40000gn/T/tmpzlyfxezn']'
+ popd
~/git/scratch/tmp

$ dvc doctor
DVC version: 2.31.0 (pip)
---------------------------------
Platform: Python 3.10.7 on macOS-12.6-arm64-arm-64bit
Subprojects:
        dvc_data = 0.22.0
        dvc_objects = 0.11.0
        dvc_render = 0.0.12
        dvc_task = 0.1.4
        dvclive = 0.12.1
        scmrepo = 0.1.2
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3)

@dberenbaum
Copy link
Collaborator

Thanks, @pmrowla. I'm able to get it working on a clean environment.

What about the directory behavior?

@pmrowla
Copy link
Contributor Author

pmrowla commented Oct 26, 2022

The directory behavior looks like a bug, I'll look into it and open a separate issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

import-url/fetch: partial imports should only be fetchable in workspace Pull imports from source
3 participants