-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc repro --pull
: invalidates cache
#10412
Comments
Are you able to provide a reproducible example? I'm not able to reproduce with https://github.com/iterative/example-get-started: $ dvc repro --pull
WARNING: Failed to pull run cache: run-cache is not supported for http filesystem: https://remote.dvc.org/get-started
Importing 'get-started/data.xml (https://github.com/iterative/dataset-registry)' -> 'data/data.xml'
Stage 'prepare' is cached - skipping run, checking out outputs
Stage 'featurize' is cached - skipping run, checking out outputs
Stage 'train' is cached - skipping run, checking out outputs
Stage 'evaluate' is cached - skipping run, checking out outputs
Use `dvc push` to send your updates to remote storage. |
I forgot to upload the vars:
- BLOB_URL: "<URL>"
- data-version.toml
stages:
download_full_parquet:
desc: Download the exported parquet files
params:
- data/data-version.toml:
- DATA_DIR
cmd: >2
<script to download from $BLOB_URL>
wdir: ..
outs:
- data/full_output:
persist: true
cache: true
download_dbg_parquet:
desc: Download the exported dbg parquet files
params:
- data/data-version.toml:
- DEBUG_DATA_DIR
cmd: >2
<script to download from $BLOB_URL>
wdir: ..
outs:
- data/dbg_output:
persist: true
cache: true
generate_full_json:
desc: Generate the json intermediaries for the full dataset.
cmd: python <script to generate json data> data/full_output data/full_json
wdir: ../
deps:
- data/full_output
- data/cached
- .hashes/code.hash
- poetry.lock
outs:
- data/full_json:
cache: true
persist: false
generate_dbg_json:
desc: Generate the json intermediaries for the debug dataset.
cmd: python <script to generate json data> data/dbg_output data/dbg_json
wdir: ../
deps:
- data/dbg_output
- data/cached
- .hashes/code.hash
- poetry.lock
outs:
- data/dbg_json:
cache: true
persist: false
generate_full_hdf5:
desc: Transform the raw json files into hdf5 files the model can understand.
cmd: python <script to transform json to hdf5> data/full_json data/full_wholesalers
wdir: ../
deps:
- poetry.lock
- .hashes/code.hash
- data/full_json
- data/fred/state_economics.json
- data/fred/fred_signals_included.json
- data/fred/national_economics.json
- data/gas_price/output/state_gas_price.json
- data/industry_trend/processed_trend.csv
- data/state_region_mapping/state_region_mapping.json
outs:
- data/full_wholesalers:
cache: true
persist: true
generate_dbg_hdf5:
desc: Transform the raw debug json files into hdf5 files the model can understand.
cmd: python <script to transform json to hdf5> data/dbg_json data/dbg_wholesalers
wdir: ../
deps:
- poetry.lock
- .hashes/code.hash
- data/dbg_json
- data/fred/state_economics.json
- data/fred/fred_signals_included.json
- data/fred/national_economics.json
- data/gas_price/output/state_gas_price.json
- data/industry_trend/processed_trend.csv
- data/state_region_mapping/state_region_mapping.json
outs:
- data/dbg_wholesalers:
cache: true
persist: true
|
I think the issue is perhaps due to persist=true? I assume on pull, persisted output stages are downloaded. I have seen that stages which are set to persist=true requires that we pull the data first. if the output path content has changed and persist was set to true, dvc doesn't skip the stage. like in the case when I forget to checkout dvc while switching branches. With |
Ah, you figured it out. |
so is this by design? I assume that for |
I think so, although I agree it's confusing. There are two ways that dvc can skip a stage:
In However, 2 succeeds for all your stages except for the persist one since that one was never saved to the run-cache for the reasons mentioned above. I think what you want is |
isn't When it comes to the 2nd case, if the workspace is empty then pull the data for the given hash key. If the workspace has different data in the workspace, either rerun the stage or give an error saying the data is different and we could provide a cmd flag or prompt input to force pull the persisted stages. Similar to dvc pull when we have data which isn't tracked by dvc (like the times when we kill the process in the middle you have dvc changes that aren't tracked). |
Unfortunately, it is used to mean different things under different commands today. This is not the meaning for
This is what |
Bug Report
repro
:dvc repro --pull
invalidates cache. doingdvc pull
and thendvc repro
works.Description
When I use
dvc repro --pull
on my dvc pipelines, it invalidates cache for certain stages and runs them again.However, if we do
dvc pull
and then dodvc repro
it respects the cache and skips the stages. This is the behaviour I expect fordvc repro --pull
as well.Also for some reason the hash key of the stage outputs keeps changing for every run as well.
Reproduce
dvc repro --pull
Expected
dvc repro --pull
must behave the same way as if we dodvc pull
and thendvc repro
.Environment information
Output of
dvc doctor
:Additional Information (if any):
dvc repro --pull data/dvc.yaml -v
: repro_pull_verbose.logThe text was updated successfully, but these errors were encountered: