-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
option in dvc run to specify class or method within file as dependency #1572
Comments
Hello, @brbarkley ! 👋 It may be possible to add "line" boundries within a file, for example: I'm afraid that adding constraints by classes ( I don't want to underestimate your project, but I think the problem could be addressed with a different design on your modules. For example, having the following file structure: dataset_builder
├── __init__.py
├── __main__.py
├── first.py
└── second.py # dataset_builder/__main__.py
import sys
from . import build
if __name__ == "__main__":
source = sys.argv[1]
build(source) # dataset_builder/__init__.py
def build(source):
if source == "first":
from .first import make_dataset
else:
from .second import make_dataset
make_dataset(source) # dataset_builder/first.py
def make_dataset(source):
print("Making first dataset") # dataset_builder/second.py
def make_dataset(source):
print("Making second dataset") You can run Thanks for your thoughtfulness 😛 let me know if it helps |
@brbarkley thank you for the feature request! As @MrOutis mentioned - DVC is a language-agnostic tool and we should NOT analyse semantic of dependencies directly. However, we got a few requests in the past for semantic tracking in databases for scenarios like dump a table if the number of rows was changed. We see an opportunity to generalize this semantic check for dependencies case. So, I opened FR #1577 for the DVC plugins. Once it is implemented we can develop a small ad-hoc script @MrOutis and @efiop what are your thoughts on this solution? |
Closing this issue in favor of #1577 |
@MrOutis @dmpetrov @efiop thanks! I think #1577 could offer a reasonable solution. I would also find a database check useful in addition to the specifics of my original FR (#1572). @MrOutis I will consider your suggestion re my file structure. Could offer greater flexibility in the near term. However, I would hesitate to build a semantic check based on a range of lines in a file since the lines in which a current method resides might not always be where it resides in the future (as edits are made to the file, etc.). |
Thanks your work on the dvc project. I've found it very useful!
I have a single file
make_dataset.py
that compiles data from disparate database sources and usually multiple views within each database. For each database, I bundle data extraction tasks from the various views as methods under a single class. This organizes my code base in a logical manner and allows the primary methods in each class to share common data cleaning operations that are particular to certain databases or views (the common data cleaning operations being housed in a single method at the top of the class).I can then call
make_dataset.py
from cmd specifying options of the datasource I want to build, e.g.,python make_dataset.py --build_source1
orpython make_dataset.py --build_source2
. I feed these respective commands todvc run
specifying the relevant dependencies, one of them of course being make_dataset.py. However, the option--build_source1
does not depend on the entire contents of make_dataset.py, only the contents within the specific class or method it is referencing.Feature request: Is it possible to add an option to dvc enabling a user to specify a class instance (or the like) within a file as a dependency instead of the entire contents of a file?
It seems this could make certain workflows more efficient. For example, if I want to
dvc status
to determine if the output ofpython make_dataset.py --build_source1
needs to be rebuilt,dvc
will tell me a rebuild is required based on content changes in make_dataset.py even if those content changes were specific only to the class associated withpython make_dataset.py --build_source2
.I suppose I could unbundle all the methods and classes into separate files or simply tolerate the lack of specificity within my pipeline, but that seems less than ideal in my case. There, of course, could be complexities I am overlooking in implementing such an option or other solutions that I'm not considering. I appreciate your thoughts either way.
Thanks again!
The text was updated successfully, but these errors were encountered: