Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugins for semantic changes tracking in dependencies #1577

Closed
dmpetrov opened this issue Feb 4, 2019 · 20 comments
Closed

Plugins for semantic changes tracking in dependencies #1577

dmpetrov opened this issue Feb 4, 2019 · 20 comments
Labels
enhancement Enhances DVC p3-nice-to-have It should be done this or next sprint

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Feb 4, 2019

Problem

DVC reproduces command if dependencies were changed. Today we support many general types of dependencies:

  1. Files in major cloud storages like S3, GCS, SSH, and others like dvc run -d azure://path/to/my blob train.py ....
  2. Local data files and code through dependencies dvc run -d train.py -d images/ train.py ...

However, there are a bunch of not general dependencies which cannot be validated by DVC.

Problem examples:

  1. Tables in a database. Usually, a custom query is needed to check if data\table\objects was changed.
  2. A semantic check in a local data or code file. For example option in dvc run to specify class or method within file as dependency #1572: check if a method mycode() was changed in class MyClass in a python file train.py.

Possible solution

A custom plugin (code) might be executed to check a dependency change. A plugin could be any command which returns 0 if repro is not needed.

Solution examples:

  1. Run a script check_db.sh to validate if a table was changed and then execute the DB dump script (if it was a change). Command example: dvc -d db_dump.sh -p check_db.sh -o clients.csv run db_dump.sh clients.csv. Note, there is a new, plugin option -p.
  2. dvc -d train.py -p "python check_method_change.py MyClass.mycode change_timestamp" -d change_timestamp -o clients.csv run train.py where check_method_change.py check the code changes and returns 0 if it was a change.

UPDATE: Please note that the script check_method_change.py might be still our responsibility and we should implement it (probably outside of DVC core).

@dmpetrov dmpetrov added the enhancement Enhances DVC label Feb 4, 2019
@dmpetrov dmpetrov changed the title Plugins for dependencies tracking Plugins for semantic changes tracking in dependencies Feb 4, 2019
@ghost
Copy link

ghost commented Feb 5, 2019

I like it, @dmpetrov , specially for working with databases in a flexible way!

Maybe, instead of using the exit code, we can track the output (for example psql -c "select count(1) from mytable") and re-run the command if the output changed (e.g. count incremented 999 -> 1000). Note that psql -c could fail due to different reasons (e.g. connectivity issues) and returning with an exit code denotating a failure would reproduce the stage, causing possibly unwanted effects.

There are several instances of dvc that verify if the dependency changed or not (repro, status, checkout, etc.) if the command takes time to run, it will slow down dvc in general.

I would prefer to sit on it and think on other solutions to comply with databases.

I may be short-sighted, but I'm not seeing any advantages to maintain a feature like that besides the integration with databases 🙈


Other possible name could be dynamic dependencies

@dmpetrov
Copy link
Member Author

dmpetrov commented Feb 5, 2019

@MrOutis totally agree! The solution can benefit a lot from this flexibility - if we save the outputs in dvc files then it will save users from having to write additional status files.

On the one hand, integration with databases is a super important scenario. On the other hand, @MrOutis brought a great point about repro, status, checkout. I can imagine that each of these commands will require a new option --no-semantic-dependency-checks. We should think carefully 🤔 before introducing this feature.

@fmannhardt
Copy link

fmannhardt commented May 25, 2019

I am currently evaluating DVC for use in our ML workflow. Databases play a role as we have images as input for which metadata needs to be stored. DVC works great for experimentation when adding a dataset directly (thanks a lot!), but in the end I want to store data independent of DVC and not duplicated.

I first thought of adding a S3 or GCP directory as external dependency (https://dvc.org/doc/user-guide/external-dependencies), but it seemed to not be geared towards supporting directories (which are expensive to find changes in). At least all my attempts failed and the documentation only shows it for files.

I am new to DVC, but could the database problem be worked around by having a "stage" in which the count query is saved to a local file which is tracked by DVC and somehow forcefully execute this even though the script has not changed? So like the --force option but only for certain type of dependency.
As this script would be cheap to execute this would not make a lot of difference if nothing has changed upstream in the database. Does this make any sense?

@efiop
Copy link
Contributor

efiop commented May 25, 2019

Hi @fmannhardt !

I first thought of adding a S3 or GCP directory as external dependency (https://dvc.org/doc/user-guide/external-dependencies), but it seemed to not be geared towards supporting directories (which are expensive to find changes in). At least all my attempts failed and the documentation only shows it for files.

Directories on both S3 and GCP can be supported as external dependencies/outputs, we just didn't get to implementing needed calls for those two types of remotes. For example, we already support
ssh directories. #1654

I am new to DVC, but could the database problem be worked around by having a "stage" in which the count query is saved to a local file which is tracked by DVC and somehow forcefully execute this even though the script has not changed? So like the --force option but only for certain type of dependency.
As this script would be cheap to execute this would not make a lot of difference if nothing has changed upstream in the database. Does this make any sense?

Sorry, I don't quite understand your scenario and your proposed solution. Could you please elaborate?

@fmannhardt
Copy link

Directories on both S3 and GCP can be supported as external dependencies/outputs, we just didn't get to implementing needed calls for those two types of remotes. For example, we already support
ssh directories. #1654

Cool. Would be great to see this.

Sorry, I don't quite understand your scenario and your proposed solution. Could you please elaborate?

The scenario is to have the images (or the image URI) in a database to be queried and used for different training sets. Now, from my understanding when I have a script querying the DB as stage in a pipeline. Now DVC would keep track of changes to the SQL query and re-execute the stage when I change the query. But it would not re-execute when additional data (images) was added to the DB through some other (non-tracked) channel. How should it now without executing the query again.

What I thought as a workaround is similar what is proposed here to have a query providing some cheap metadata that can be tuned to the desired level of robustness, e.g. the total count of rows in an append-only DB would be enough. But differently from what I read here, this query would write be executed in a standard DVC stage that write the result to a file that is tracked by DVC as output. Now, in case this output was changes (detected with the standard MD5 mechanism), everything downstream would need to be re-run. Otherwise, everything is assumed to be up-to-date.

Of course, this should only be done upon request from the user to keep having reproducible results for previous executions of the pipeline. I saw the --force parameter, but this would re-run everything and the --single-item parameter, but this would not run the remainder of the pipeline. Assuming count_query.dvc is a cheap query to identify updates and experiment.dvc the expensive training.
Maybe a workaround would be to have dvc repro --force --single-item count_query.dvc followed by dvc repro experiment.dvc?

What I was proposing is to somehow automate this by marking count_query.dvc as cheap operation which is always re-run when dvc repro is run with some kind of force-update-cheap-operations flag. I hope this makes it clear? As I said, I am new to DVC so maybe there are some mistakes in my line of thoughts.

@efiop
Copy link
Contributor

efiop commented May 27, 2019

@fmannhardt Thanks for the explanation! 🙂

Maybe a workaround would be to have dvc repro --force --single-item count_query.dvc followed by dvc repro experiment.dvc?

Yes, I think so.

What I was proposing is to somehow automate this by marking count_query.dvc as cheap operation which is always re-run when dvc repro is run with some kind of force-update-cheap-operations flag. I hope this makes it clear? As I said, I am new to DVC so maybe there are some mistakes in my line of thoughts.

We have a so-called "callback" stages, that don't have dependencies and run every time you run "dvc repro"(e.g. dvc run -o foo 'echo foo > foo'). Maybe that would be suitable for your scenario? Those are run every time on dvc repro` and don't have any special option to turn them on and off, but if the execution of those is cheap, maybe it would be ok to run them every time?

@fmannhardt
Copy link

We have a so-called "callback" stages, that don't have dependencies and run every time you run "dvc repro"(e.g. dvc run -o foo 'echo foo > foo'). Maybe that would be suitable for your scenario? Those are run every time on dvc repro` and don't have any special option to turn them on and off, but if the execution of those is cheap, maybe it would be ok to run them every time?

I think this feature would do the trick. Thanks!

@efiop efiop added p3-nice-to-have It should be done this or next sprint and removed p4 labels Sep 30, 2019
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 23, 2019

A user asked about this use case up today on Discord. Specifically about DVC understanding Python imports inside commands fed to dvc run so if a.py imports b.py (both being project source code, not libraries) and a.py is tracked by a stage file, but then only b.py changes, dvc repro would not recognize that it needs to rebuild the cache.

So besides implementing these plugins or middleware that Dmitry mentioned, what about out-of-box support for certain programming languages like Python, C++, etc? so in the case above, DVC would autodetect that a.py is a Python file and examines its import statements, registering the file imports (found in the workspace) automatically as dependencies in the stage file.

@dmpetrov
Copy link
Member Author

@jorgeorpinel yes, it is a bit different use case - python file dependencies is not the same as dependencies to python functions from the initial message.

The file dependencies use case should be easier to implement, I guess. Package systems should be able to track the dependencies check and I hope this ideas (or code) can be reused in DVC.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Dec 23, 2019

Yes, it's a bit different but related. I can open a separate issue if you prefer.

I'm no talking about packages or libraries though, in that case you could kind of hack it now, by having requirements.txt as a dependency, for example (in Python). I'm talking about inter-dependencies between source code files in the project i.e. when your stage is spread in several source code files, but only one is executable and marked as a dvc run -d. A solution is to just mark all the other files as dependencies, but there could potentially be many of these files and inside recursive directory structures (e.g. when developing an ML library).

Also note I'm not just talking about Python code but multiple languages. I guess Python would be a first obvious platform to include such a feature for, since our core code is also Python.

@anotherbugmaster
Copy link
Contributor

anotherbugmaster commented Jan 27, 2020

Hi everyone. I think I have an idea about how to implement this for python (and many other languages, actually):

We can manually compile python endpoint like that:

python -m compileall script_to_run.py

and to add a script_to_run.pyc as a dependency to consequent scripts. Python interpreter doesn't re-compile .py whose dependencies weren't changed, which is exactly what we need in this case.

This also works with C/C++: we just need to use compiled endpoint as a dependency.

In case of databases I think we could take advantage of information_schema.tables, AFAIK there should be information about last update time. This brings us back to the timestamps instead of hashing but at least that's something.

So all that DVC plugin should do is to automatically compile endpoint and to redirect code dependencies to that binary. We could make some kind of flag, like --auto-dependencies which would switch this behavior on.

@shcheklein
Copy link
Member

@anotherbugmaster sounds like a good option to automatically detect all the changes in the dependencies recursively and it will probably avoid rerunning stuff if I changed only a comment or whitespace in the script?

It's not a solution for the:

A semantic check in a local data or code file. For example #1572: check if a method mycode() was changed in class MyClass in a python file train.py.

as far as I can tell.

@anotherbugmaster
Copy link
Contributor

Yeah, seems like I misunderstood the issue here.

The approach would be useful anyway, if only we had a way to split up a source file into symbols, which in turn would be hashed.

@dmpetrov
Copy link
Member Author

dmpetrov commented Feb 2, 2020

The approach would be useful anyway, if only we had a way to split up a source file into symbols, which in turn would be hashed.

@anotherbugmaster for sure. It can be a part of a solution.

@mdekstrand
Copy link

Over in #2378, we are discussing a similar issue, and I have hacked up support for database status checking by monkey-patching a custom remote (with associated output and dependency support) into DVC: #2378 (comment)

As I've thought more on this issue, I've become increasingly persuaded that external dependencies with custom remote schemas are one of the more elegant ways to deal with this family of issues, in particularly because they do not require adding any new syntax or concepts to DVC stage files - they just need the ability to dispatch URLs with a custom scheme to an appropriate class, function, or command.

@jtlz2
Copy link

jtlz2 commented Aug 13, 2020

@dmpetrov Is there any update on how to use DVC to track a database e.g. a mongodb collection?

@efiop
Copy link
Contributor

efiop commented Aug 15, 2020

@jtlz2 No updates for now 🙁

@jorgeorpinel
Copy link
Contributor

Here's another case for this (I think) from a user on the forum: https://discuss.dvc.org/t/update-same-output-dir-in-different-stages/620

@jhrmnn
Copy link

jhrmnn commented Mar 24, 2021

As part of my project, I wrote a function that hashes a given Python function based on its AST and recursively any global objects it contains (including other functions it calls)

https://github.com/jhrmnn/mona/blob/master/src/mona/pyhash.py

Published under MPL 2.0, maybe you could reuse this.

@jhrmnn
Copy link

jhrmnn commented Mar 24, 2021

Alternatively, if there is interest, I could carve it out into a separate package.

@efiop efiop closed this as completed May 3, 2021
@iterative iterative locked and limited conversation to collaborators May 3, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement Enhances DVC p3-nice-to-have It should be done this or next sprint
Projects
None yet
Development

No branches or pull requests

9 participants