Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

option in dvc run to specify class or method within file as dependency #1572

Closed
brbarkley opened this issue Feb 4, 2019 · 6 comments
Closed
Labels
feature request Requesting a new feature

Comments

@brbarkley
Copy link
Contributor

Thanks your work on the dvc project. I've found it very useful!

I have a single file make_dataset.py that compiles data from disparate database sources and usually multiple views within each database. For each database, I bundle data extraction tasks from the various views as methods under a single class. This organizes my code base in a logical manner and allows the primary methods in each class to share common data cleaning operations that are particular to certain databases or views (the common data cleaning operations being housed in a single method at the top of the class).

I can then call make_dataset.py from cmd specifying options of the datasource I want to build, e.g., python make_dataset.py --build_source1 or python make_dataset.py --build_source2. I feed these respective commands to dvc run specifying the relevant dependencies, one of them of course being make_dataset.py. However, the option --build_source1 does not depend on the entire contents of make_dataset.py, only the contents within the specific class or method it is referencing.

Feature request: Is it possible to add an option to dvc enabling a user to specify a class instance (or the like) within a file as a dependency instead of the entire contents of a file?

It seems this could make certain workflows more efficient. For example, if I want to dvc status to determine if the output of python make_dataset.py --build_source1 needs to be rebuilt, dvc will tell me a rebuild is required based on content changes in make_dataset.py even if those content changes were specific only to the class associated with python make_dataset.py --build_source2.

I suppose I could unbundle all the methods and classes into separate files or simply tolerate the lack of specificity within my pipeline, but that seems less than ideal in my case. There, of course, could be complexities I am overlooking in implementing such an option or other solutions that I'm not considering. I appreciate your thoughts either way.

Thanks again!

@ghost ghost added the feature request Requesting a new feature label Feb 4, 2019
@ghost
Copy link

ghost commented Feb 4, 2019

Hello, @brbarkley ! 👋

It may be possible to add "line" boundries within a file, for example:
dvc run -d file.py:10-20 (or a similar syntax) to express that you only want to depend on that range (10th to 20th line) of file.py.

I'm afraid that adding constraints by classes (dvc run -d file.py:MyClass) would break compatibility with other languages (or at least, favor the use of Python);
DVC tries to be language-agnostic, SCM agnostic (not relying only on Git), even Storage agnostic (use S3, SSH, or whatever protocol you want).

I don't want to underestimate your project, but I think the problem could be addressed with a different design on your modules.

For example, having the following file structure:

dataset_builder
├── __init__.py
├── __main__.py
├── first.py
└── second.py
# dataset_builder/__main__.py
import sys
from . import build

if __name__ == "__main__":
    source = sys.argv[1]

    build(source)
# dataset_builder/__init__.py
def build(source):
    if source == "first":
        from .first import make_dataset
    else:
        from .second import make_dataset

    make_dataset(source)
# dataset_builder/first.py
def make_dataset(source):
    print("Making first dataset")
# dataset_builder/second.py
def make_dataset(source):
    print("Making second dataset")

You can run dvc run -d dataset_builder/first.py 'python -m dataset_builder first'


Thanks for your thoughtfulness 😛 let me know if it helps

@dmpetrov
Copy link
Member

dmpetrov commented Feb 4, 2019

@brbarkley thank you for the feature request!

As @MrOutis mentioned - DVC is a language-agnostic tool and we should NOT analyse semantic of dependencies directly.

However, we got a few requests in the past for semantic tracking in databases for scenarios like dump a table if the number of rows was changed.

We see an opportunity to generalize this semantic check for dependencies case. So, I opened FR #1577 for the DVC plugins. Once it is implemented we can develop a small ad-hoc script check_method_change.py (see #1577) and close the current issue.

@MrOutis and @efiop what are your thoughts on this solution?

@ghost
Copy link

ghost commented Feb 5, 2019

Closing this issue in favor of #1577

@ghost ghost closed this as completed Feb 5, 2019
@dmpetrov
Copy link
Member

dmpetrov commented Feb 5, 2019

@MrOutis please note that the script check_method_change.py might be still our responsibility (outside of DVC). And it is purely related to this FR, not #1577.

@efiop
Copy link
Contributor

efiop commented Feb 5, 2019

@dmpetrov we can add it as a ticbox in #1577, no need to keep a separate issue for now.

@brbarkley
Copy link
Contributor Author

@MrOutis @dmpetrov @efiop thanks!

I think #1577 could offer a reasonable solution. I would also find a database check useful in addition to the specifics of my original FR (#1572).

@MrOutis I will consider your suggestion re my file structure. Could offer greater flexibility in the near term. However, I would hesitate to build a semantic check based on a range of lines in a file since the lines in which a current method resides might not always be where it resides in the future (as edits are made to the file, etc.).

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

3 participants