Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc repro --pull: does not pull the data but rather throws an error #7420

Closed
francesco086 opened this issue Feb 23, 2022 · 5 comments
Closed

Comments

@francesco086
Copy link

Bug Report

Description

The command does not pull the data. If I first run dvc pull and then dvc repro everything works fine.

Reproduce

  1. clone project with this dvc.lock file (I omit the dvc.yaml file as it is obvious from this):
schema: '2.0'
stages:
  prepare:
    cmd: ccs-prepare-trip-and-carrier-files --d4s-carriers-input-file-path=../curated-data/carriers.parquet
      --itms-trips-input-file-path=../curated-data/itms_trips.parquet --carriers-output-file-path=prepare/carriers.parquet
      --trips-output-file-path=prepare/trips.parquet
    deps:
    - path: ../curated-data/carriers.parquet
      md5: f986f68adf848bd94417c7fb70ebb6c0
      size: 16649304
    - path: ../curated-data/itms_trips.parquet
      md5: 505169e3e9c22a4a343f91ed945ad4f0
      size: 99230362
    outs:
    - path: prepare
      md5: 83ff1c7bc858643cd19cbfd7fa00260f.dir
      size: 42710154
      nfiles: 2
  elaborate:
    cmd: mkdir elaborate; wc -l prepare/carriers.parquet > elaborate/wcl.txt; wc -l
      prepare/trips.parquet >> elaborate/wcl.txt
    deps:
    - path: prepare/carriers.parquet
      md5: 720d1597378b2b41b4bcdec723b0f0c5
      size: 4177444
    - path: prepare/trips.parquet
      md5: 4d14501ebead5d1ebdc342eac1bf7ea2
      size: 38532710
    outs:
    - path: elaborate
      md5: 875dceddee991845ac5f33608d5e42ab.dir
      size: 65
      nfiles: 1
  1. dvc repro --pull
  2. get this error:
Verifying data sources in stage: '../curated-data/carriers.parquet.dvc'
ERROR: failed to reproduce '../curated-data/carriers.parquet.dvc': missing data 'source': ../curated-data/carriers.parquet

Expected

Missing data source ../curated-data/carriers.parquet should be pulled (or at least attempted to be downloaded).

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.9.5 (brew)
---------------------------------
Platform: Python 3.9.10 on macOS-11.6.4-x86_64-i386-64bit
Supports:
	azure (adlfs = 2022.2.0, knack = 0.9.0, azure-identity = 1.7.1),
	gdrive (pydrive2 = 1.10.0),
	gs (gcsfs = 2022.1.0),
	webhdfs (fsspec = 2022.1.0),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	s3 (s3fs = 2022.1.0, boto3 = 1.20.24),
	ssh (sshfs = 2021.11.2),
	webdav (webdav4 = 0.9.4),
	webdavs (webdav4 = 0.9.4)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: azure
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git

Additional Information (if any):

As mentioned, if I run dvc pull and then dvc repro everything works fine.

@efiop
Copy link
Contributor

efiop commented Feb 23, 2022

@efiop
Copy link
Contributor

efiop commented Feb 23, 2022

Looks like there is some confusion with --pull option. It is meant to automatically download outputs of stages if they are found in run-cache, as explained in https://dvc.org/doc/command-reference/repro . So this is expected behavior for it. You are looking for a new feature that would do dvc pull for dependencies automatically.

@francesco086
Copy link
Author

oh ok, my bad! I am closing the issue.

@DXist
Copy link

DXist commented Sep 12, 2022

This future could be useful. Now I have to add pull command as a pipeline stage.

@dberenbaum
Copy link
Collaborator

@DXist There is an open feature request in #4742. Feel free to add your thoughts there, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants