Plugins for semantic changes tracking in dependencies #1577

dmpetrov · 2019-02-04T22:44:30Z

Problem

DVC reproduces command if dependencies were changed. Today we support many general types of dependencies:

Files in major cloud storages like S3, GCS, SSH, and others like dvc run -d azure://path/to/my blob train.py ....
Local data files and code through dependencies dvc run -d train.py -d images/ train.py ...

However, there are a bunch of not general dependencies which cannot be validated by DVC.

Problem examples:

Tables in a database. Usually, a custom query is needed to check if data\table\objects was changed.
A semantic check in a local data or code file. For example option in dvc run to specify class or method within file as dependency #1572: check if a method mycode() was changed in class MyClass in a python file train.py.

Possible solution

A custom plugin (code) might be executed to check a dependency change. A plugin could be any command which returns 0 if repro is not needed.

Solution examples:

Run a script check_db.sh to validate if a table was changed and then execute the DB dump script (if it was a change). Command example: dvc -d db_dump.sh -p check_db.sh -o clients.csv run db_dump.sh clients.csv. Note, there is a new, plugin option -p.
dvc -d train.py -p "python check_method_change.py MyClass.mycode change_timestamp" -d change_timestamp -o clients.csv run train.py where check_method_change.py check the code changes and returns 0 if it was a change.

UPDATE: Please note that the script check_method_change.py might be still our responsibility and we should implement it (probably outside of DVC core).

The text was updated successfully, but these errors were encountered:

ghost · 2019-02-05T00:25:55Z

I like it, @dmpetrov , specially for working with databases in a flexible way!

Maybe, instead of using the exit code, we can track the output (for example psql -c "select count(1) from mytable") and re-run the command if the output changed (e.g. count incremented 999 -> 1000). Note that psql -c could fail due to different reasons (e.g. connectivity issues) and returning with an exit code denotating a failure would reproduce the stage, causing possibly unwanted effects.

There are several instances of dvc that verify if the dependency changed or not (repro, status, checkout, etc.) if the command takes time to run, it will slow down dvc in general.

I would prefer to sit on it and think on other solutions to comply with databases.

I may be short-sighted, but I'm not seeing any advantages to maintain a feature like that besides the integration with databases 🙈

Other possible name could be dynamic dependencies

dmpetrov · 2019-02-05T00:48:59Z

@MrOutis totally agree! The solution can benefit a lot from this flexibility - if we save the outputs in dvc files then it will save users from having to write additional status files.

On the one hand, integration with databases is a super important scenario. On the other hand, @MrOutis brought a great point about repro, status, checkout. I can imagine that each of these commands will require a new option --no-semantic-dependency-checks. We should think carefully 🤔 before introducing this feature.

fmannhardt · 2019-05-25T20:06:22Z

I am currently evaluating DVC for use in our ML workflow. Databases play a role as we have images as input for which metadata needs to be stored. DVC works great for experimentation when adding a dataset directly (thanks a lot!), but in the end I want to store data independent of DVC and not duplicated.

I first thought of adding a S3 or GCP directory as external dependency (https://dvc.org/doc/user-guide/external-dependencies), but it seemed to not be geared towards supporting directories (which are expensive to find changes in). At least all my attempts failed and the documentation only shows it for files.

I am new to DVC, but could the database problem be worked around by having a "stage" in which the count query is saved to a local file which is tracked by DVC and somehow forcefully execute this even though the script has not changed? So like the --force option but only for certain type of dependency.
As this script would be cheap to execute this would not make a lot of difference if nothing has changed upstream in the database. Does this make any sense?

efiop · 2019-05-25T20:18:58Z

Hi @fmannhardt !

I first thought of adding a S3 or GCP directory as external dependency (https://dvc.org/doc/user-guide/external-dependencies), but it seemed to not be geared towards supporting directories (which are expensive to find changes in). At least all my attempts failed and the documentation only shows it for files.

Directories on both S3 and GCP can be supported as external dependencies/outputs, we just didn't get to implementing needed calls for those two types of remotes. For example, we already support
ssh directories. #1654

I am new to DVC, but could the database problem be worked around by having a "stage" in which the count query is saved to a local file which is tracked by DVC and somehow forcefully execute this even though the script has not changed? So like the --force option but only for certain type of dependency.
As this script would be cheap to execute this would not make a lot of difference if nothing has changed upstream in the database. Does this make any sense?

Sorry, I don't quite understand your scenario and your proposed solution. Could you please elaborate?

fmannhardt · 2019-05-26T11:09:15Z

Directories on both S3 and GCP can be supported as external dependencies/outputs, we just didn't get to implementing needed calls for those two types of remotes. For example, we already support
ssh directories. #1654

Cool. Would be great to see this.

Sorry, I don't quite understand your scenario and your proposed solution. Could you please elaborate?

The scenario is to have the images (or the image URI) in a database to be queried and used for different training sets. Now, from my understanding when I have a script querying the DB as stage in a pipeline. Now DVC would keep track of changes to the SQL query and re-execute the stage when I change the query. But it would not re-execute when additional data (images) was added to the DB through some other (non-tracked) channel. How should it now without executing the query again.

What I thought as a workaround is similar what is proposed here to have a query providing some cheap metadata that can be tuned to the desired level of robustness, e.g. the total count of rows in an append-only DB would be enough. But differently from what I read here, this query would write be executed in a standard DVC stage that write the result to a file that is tracked by DVC as output. Now, in case this output was changes (detected with the standard MD5 mechanism), everything downstream would need to be re-run. Otherwise, everything is assumed to be up-to-date.

Of course, this should only be done upon request from the user to keep having reproducible results for previous executions of the pipeline. I saw the --force parameter, but this would re-run everything and the --single-item parameter, but this would not run the remainder of the pipeline. Assuming count_query.dvc is a cheap query to identify updates and experiment.dvc the expensive training.
Maybe a workaround would be to have dvc repro --force --single-item count_query.dvc followed by dvc repro experiment.dvc?

What I was proposing is to somehow automate this by marking count_query.dvc as cheap operation which is always re-run when dvc repro is run with some kind of force-update-cheap-operations flag. I hope this makes it clear? As I said, I am new to DVC so maybe there are some mistakes in my line of thoughts.

efiop · 2019-05-27T01:04:02Z

@fmannhardt Thanks for the explanation! 🙂

Maybe a workaround would be to have dvc repro --force --single-item count_query.dvc followed by dvc repro experiment.dvc?

Yes, I think so.

What I was proposing is to somehow automate this by marking count_query.dvc as cheap operation which is always re-run when dvc repro is run with some kind of force-update-cheap-operations flag. I hope this makes it clear? As I said, I am new to DVC so maybe there are some mistakes in my line of thoughts.

We have a so-called "callback" stages, that don't have dependencies and run every time you run "dvc repro"(e.g. dvc run -o foo 'echo foo > foo'). Maybe that would be suitable for your scenario? Those are run every time on dvc repro` and don't have any special option to turn them on and off, but if the execution of those is cheap, maybe it would be ok to run them every time?

fmannhardt · 2019-05-27T18:43:38Z

We have a so-called "callback" stages, that don't have dependencies and run every time you run "dvc repro"(e.g. dvc run -o foo 'echo foo > foo'). Maybe that would be suitable for your scenario? Those are run every time on dvc repro` and don't have any special option to turn them on and off, but if the execution of those is cheap, maybe it would be ok to run them every time?

I think this feature would do the trick. Thanks!

jorgeorpinel · 2019-12-23T07:51:59Z

A user asked about this use case up today on Discord. Specifically about DVC understanding Python imports inside commands fed to dvc run so if a.py imports b.py (both being project source code, not libraries) and a.py is tracked by a stage file, but then only b.py changes, dvc repro would not recognize that it needs to rebuild the cache.

So besides implementing these plugins or middleware that Dmitry mentioned, what about out-of-box support for certain programming languages like Python, C++, etc? so in the case above, DVC would autodetect that a.py is a Python file and examines its import statements, registering the file imports (found in the workspace) automatically as dependencies in the stage file.

dmpetrov · 2019-12-23T08:20:40Z

@jorgeorpinel yes, it is a bit different use case - python file dependencies is not the same as dependencies to python functions from the initial message.

The file dependencies use case should be easier to implement, I guess. Package systems should be able to track the dependencies check and I hope this ideas (or code) can be reused in DVC.

jorgeorpinel · 2019-12-23T08:49:02Z

Yes, it's a bit different but related. I can open a separate issue if you prefer.

I'm no talking about packages or libraries though, in that case you could kind of hack it now, by having requirements.txt as a dependency, for example (in Python). I'm talking about inter-dependencies between source code files in the project i.e. when your stage is spread in several source code files, but only one is executable and marked as a dvc run -d. A solution is to just mark all the other files as dependencies, but there could potentially be many of these files and inside recursive directory structures (e.g. when developing an ML library).

Also note I'm not just talking about Python code but multiple languages. I guess Python would be a first obvious platform to include such a feature for, since our core code is also Python.

anotherbugmaster · 2020-01-27T12:53:01Z

Hi everyone. I think I have an idea about how to implement this for python (and many other languages, actually):

We can manually compile python endpoint like that:

python -m compileall script_to_run.py

and to add a script_to_run.pyc as a dependency to consequent scripts. Python interpreter doesn't re-compile .py whose dependencies weren't changed, which is exactly what we need in this case.

This also works with C/C++: we just need to use compiled endpoint as a dependency.

In case of databases I think we could take advantage of information_schema.tables, AFAIK there should be information about last update time. This brings us back to the timestamps instead of hashing but at least that's something.

So all that DVC plugin should do is to automatically compile endpoint and to redirect code dependencies to that binary. We could make some kind of flag, like --auto-dependencies which would switch this behavior on.

shcheklein · 2020-01-27T19:12:07Z

@anotherbugmaster sounds like a good option to automatically detect all the changes in the dependencies recursively and it will probably avoid rerunning stuff if I changed only a comment or whitespace in the script?

It's not a solution for the:

A semantic check in a local data or code file. For example #1572: check if a method mycode() was changed in class MyClass in a python file train.py.

as far as I can tell.

anotherbugmaster · 2020-01-28T10:27:13Z

Yeah, seems like I misunderstood the issue here.

The approach would be useful anyway, if only we had a way to split up a source file into symbols, which in turn would be hashed.

dmpetrov · 2020-02-02T23:41:18Z

The approach would be useful anyway, if only we had a way to split up a source file into symbols, which in turn would be hashed.

@anotherbugmaster for sure. It can be a part of a solution.

mdekstrand · 2020-03-16T18:50:13Z

Over in #2378, we are discussing a similar issue, and I have hacked up support for database status checking by monkey-patching a custom remote (with associated output and dependency support) into DVC: #2378 (comment)

As I've thought more on this issue, I've become increasingly persuaded that external dependencies with custom remote schemas are one of the more elegant ways to deal with this family of issues, in particularly because they do not require adding any new syntax or concepts to DVC stage files - they just need the ability to dispatch URLs with a custom scheme to an appropriate class, function, or command.

jtlz2 · 2020-08-13T08:12:34Z

@dmpetrov Is there any update on how to use DVC to track a database e.g. a mongodb collection?

efiop · 2020-08-15T18:04:19Z

@jtlz2 No updates for now 🙁

jorgeorpinel · 2021-01-15T02:09:06Z

Here's another case for this (I think) from a user on the forum: https://discuss.dvc.org/t/update-same-output-dir-in-different-stages/620

jhrmnn · 2021-03-24T21:57:16Z

As part of my project, I wrote a function that hashes a given Python function based on its AST and recursively any global objects it contains (including other functions it calls)

https://github.com/jhrmnn/mona/blob/master/src/mona/pyhash.py

Published under MPL 2.0, maybe you could reuse this.

jhrmnn · 2021-03-24T22:21:45Z

Alternatively, if there is interest, I could carve it out into a separate package.

dmpetrov added the enhancement Enhances DVC label Feb 4, 2019

dmpetrov mentioned this issue Feb 4, 2019

option in dvc run to specify class or method within file as dependency #1572

Closed

dmpetrov changed the title ~~Plugins for dependencies tracking~~ Plugins for semantic changes tracking in dependencies Feb 4, 2019

efiop added the p4-not-important label Jul 23, 2019

mdekstrand mentioned this issue Aug 7, 2019

Support callback dependencies #2378

Open

efiop added p3-nice-to-have It should be done this or next sprint and removed p4 labels Sep 30, 2019

dmpetrov mentioned this issue Dec 23, 2019

Support for Databae queries/Views as data sources #2945

Closed

efiop mentioned this issue Jan 27, 2020

Stating dependencies between scripts/modules #1401

Closed

jorgeorpinel mentioned this issue Mar 4, 2020

Support function specific dependencies #3439

Open

efiop mentioned this issue Jun 23, 2020

trace dependencies in dvc run #4081

Closed

efiop closed this as completed May 3, 2021

iterative locked and limited conversation to collaborators May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Plugins for semantic changes tracking in dependencies #1577

Plugins for semantic changes tracking in dependencies #1577

dmpetrov commented Feb 4, 2019 •

edited

Loading

ghost commented Feb 5, 2019 •

edited by ghost

Loading

dmpetrov commented Feb 5, 2019 •

edited

Loading

fmannhardt commented May 25, 2019 •

edited

Loading

efiop commented May 25, 2019

fmannhardt commented May 26, 2019

efiop commented May 27, 2019

fmannhardt commented May 27, 2019

jorgeorpinel commented Dec 23, 2019 •

edited

Loading

dmpetrov commented Dec 23, 2019

jorgeorpinel commented Dec 23, 2019 •

edited

Loading

anotherbugmaster commented Jan 27, 2020 •

edited

Loading

shcheklein commented Jan 27, 2020

anotherbugmaster commented Jan 28, 2020

dmpetrov commented Feb 2, 2020

mdekstrand commented Mar 16, 2020

jtlz2 commented Aug 13, 2020

efiop commented Aug 15, 2020

jorgeorpinel commented Jan 15, 2021

jhrmnn commented Mar 24, 2021

jhrmnn commented Mar 24, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Plugins for semantic changes tracking in dependencies #1577

Plugins for semantic changes tracking in dependencies #1577

Comments

dmpetrov commented Feb 4, 2019 • edited Loading

ghost commented Feb 5, 2019 • edited by ghost Loading

dmpetrov commented Feb 5, 2019 • edited Loading

fmannhardt commented May 25, 2019 • edited Loading

efiop commented May 25, 2019

fmannhardt commented May 26, 2019

efiop commented May 27, 2019

fmannhardt commented May 27, 2019

jorgeorpinel commented Dec 23, 2019 • edited Loading

dmpetrov commented Dec 23, 2019

jorgeorpinel commented Dec 23, 2019 • edited Loading

anotherbugmaster commented Jan 27, 2020 • edited Loading

shcheklein commented Jan 27, 2020

anotherbugmaster commented Jan 28, 2020

dmpetrov commented Feb 2, 2020

mdekstrand commented Mar 16, 2020

jtlz2 commented Aug 13, 2020

efiop commented Aug 15, 2020

jorgeorpinel commented Jan 15, 2021

jhrmnn commented Mar 24, 2021

jhrmnn commented Mar 24, 2021

This issue was moved to a discussion.

dmpetrov commented Feb 4, 2019 •

edited

Loading

ghost commented Feb 5, 2019 •

edited by ghost

Loading

dmpetrov commented Feb 5, 2019 •

edited

Loading

fmannhardt commented May 25, 2019 •

edited

Loading

jorgeorpinel commented Dec 23, 2019 •

edited

Loading

jorgeorpinel commented Dec 23, 2019 •

edited

Loading

anotherbugmaster commented Jan 27, 2020 •

edited

Loading