Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out: persitent outputs require odb when they should not #6533

Closed
raylutz opened this issue Sep 4, 2021 · 4 comments
Closed

out: persitent outputs require odb when they should not #6533

raylutz opened this issue Sep 4, 2021 · 4 comments
Labels
bug Did we break something? research

Comments

@raylutz
Copy link

raylutz commented Sep 4, 2021

Bug Report

Description

I am trying to use dvc to handle changes to program build data (not ML data). We don't need the cache and all data will be saved directly to s3. This is an unusual edge case and the Output.odb() property method assumes .repo.odb exists.

Reproduce

I have a single stage file:

stages:
    precheck:
        cmd:    python main.py -i JOB_WI_Ozaukee_20201103.csv --op precheck
        wdir:   C:\Users\raylu\Documents\Github\audit-engine
        deps:   
            - s3://us-east-1-audit-engine-election-data/US/WI/US_WI_Ozaukee_General_20201103/WI_Ozaukee_20201103_BIA_0.zip
            - s3://us-east-1-audit-engine-election-data/US/WI/US_WI_Ozaukee_General_20201103/WI_Ozaukee_20201103_BIA_1.zip
            - s3://us-east-1-audit-engine-election-data/US/WI/US_WI_Ozaukee_General_20201103/WI_Ozaukee_20201103_BIA_2.zip
        outs:
            - s3://us-east-1-audit-engine-jobs/US/WI/US_WI_Ozaukee_General_20201103/cache/all_archives_info.json:
                cache: false           
                persist: true
            - s3://us-east-1-audit-engine-jobs/US/WI/US_WI_Ozaukee_General_20201103/cache/bia_bif.csv:
                cache: false          
                persist: true
            - s3://us-east-1-audit-engine-jobs/US/WI/US_WI_Ozaukee_General_20201103/reports/precheck_report.md:
                cache: false          
                persist: true
            - s3://us-east-1-audit-engine-jobs/US/WI/US_WI_Ozaukee_General_20201103/reports/precheck_report.html:
                cache: false          
                persist: true
  1. The cmd probably can be just about anything, maybe 'echo "initialized"'

  2. reference some s3: resources.

  3. turn off caching

    dvc init --no-scm

  4. dvc repro -R -v --force dvc

Expected

Here is the error output I am getting:

PS C:\Users\raylu\Documents\Github\audit-engine> dvc repro -R -v --force dvc
2021-09-03 17:59:12,933 ERROR: failed to reproduce 'dvc\precheck\dvc.yaml': 'NoneType' object has no attribute 'unprotect'
------------------------------------------------------------
Traceback (most recent call last):
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\repo\reproduce.py", line 196, in _reproduce_stages
    ret = _reproduce_stage(stage, **kwargs)
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\repo\reproduce.py", line 39, in _reproduce_stage
    stage = stage.reproduce(**kwargs)
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\funcy\decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\stage\decorators.py", line 36, in rwlocked
    return call()
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\funcy\decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\stage\__init__.py", line 427, in reproduce
    self.run(**kwargs)
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\funcy\decorators.py", line 45, in wrapper
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\stage\decorators.py", line 36, in rwlocked
    return call()
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\funcy\decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
    self.remove_outs(ignore_remove=False, force=False)
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\funcy\decorators.py", line 45, in wrapper
    return deco(call, *dargs, **dkwargs)
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\stage\decorators.py", line 36, in rwlocked
    return call()
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\funcy\decorators.py", line 66, in __call__
    return self._func(*self._args, **self._kwargs)
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\stage\__init__.py", line 371, in remove_outs
    out.unprotect()
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\output.py", line 836, in unprotect
    self.odb.unprotect(self.path_info)
AttributeError: 'NoneType' object has no attribute 'unprotect'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\main.py", line 55, in main
    ret = cmd.do_run()
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\command\base.py", line 45, in do_run
    return self.run()
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\command\repro.py", line 12, in run
    stages = self.repo.reproduce(**self._repro_kwargs)
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\repo\__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\repo\scm_context.py", line 14, in run
    return method(repo, *args, **kw)
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\repo\reproduce.py", line 135, in reproduce
    return _reproduce_stages(self.index.graph, list(stages), **kwargs)
  File "c:\users\raylu\appdata\local\programs\python\python37\lib\site-packages\dvc\repo\reproduce.py", line 213, in _reproduce_stages
    raise ReproductionError(stage.relpath) from exc
dvc.exceptions.ReproductionError: failed to reproduce 'dvc\precheck\dvc.yaml'
------------------------------------------------------------
2021-09-03 17:59:12,955 DEBUG: Analytics is disabled.

Environment information

Output of dvc doctor:

PS C:\Users\raylu\Documents\Github\audit-engine> dvc doctor
DVC version: 2.6.4 (pip)
---------------------------------
Platform: Python 3.7.6 on Windows-10-10.0.19041-SP0
Supports:
        http (requests = 2.24.0),
        https (requests = 2.24.0),
        s3 (s3fs = 2021.6.1, boto3 = 1.14.11)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: NTFS on C:\
Repo: dvc (no_scm)

Additional Information (if any):

@pared
Copy link
Contributor

pared commented Sep 9, 2021

@raylutz
I am able to reproduce - seems like persist is a culprit.

Workaround: setting up cache on s3 should let your pipeline run:

dvc remote add s3cache $S3/cache 
dvc config cache.s3 s3cache

Reproduction script:

#!/bin/bash

cd $TMPDIR

set -ex

rm -rf test_wspace
mkdir test_wspace
pushd test_wspace

echo data > data

S3={TEST_S3_BUCKET}

aws s3 cp data $S3/data

mkdir test_repo
pushd test_repo

dvc init --no-scm

dvc run --external -d $S3/data --outs-persist-no-cache $S3/out -n copy aws s3 cp $S3/data $S3/out

Seems like this issue is related to unprotect problem that one of our users have been having:
https://discord.com/channels/485586884165107732/485596304961962003/884521730733932544

EDIT:
mentioned user created a ticket: #6562

@pared pared added bug Did we break something? research labels Sep 9, 2021
@pared pared changed the title repro: fails to initialize with no scm, no remote, no cache due to no odb when trying to unprotect out: persitent outputs require odb when they should not Sep 9, 2021
@raylutz
Copy link
Author

raylutz commented Sep 14, 2021

Thanks for the better name. I will tell you how I am now organizing this so I can avoid unnecessary transfers.

  1. I have one file which is generated which is persist:false so that I can tell if the stage build was successful.
  2. The other files are routinely checked if they already exist and are the same on s3. If the 'fragile' etag calcs fail, then it just forces an unnecessary upload.

@raylutz
Copy link
Author

raylutz commented Dec 4, 2021

We have decided not to use DVC and have implemented our own similar functionality. Thanks for your time.

@efiop
Copy link
Contributor

efiop commented Dec 5, 2021

Please note that using direct s3 paths as outputs/dependencies is an experimental scenario https://dvc.org/doc/user-guide/managing-external-data that has many considerations. We will be developing it properly in the future #3920

That being said, the bug here should be fixed, as it results in unnecessary unprotect calls for local uncached outputs as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? research
Projects
None yet
Development

No branches or pull requests

4 participants