-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support function specific dependencies #3439
Comments
Ok, so first obstacle I see is that we are trying to make dvc language agnostic, so the specification of how to define this |
We'd need specific per-language support (based on file extension) to extract source code of functions/classes and md5 hash them. Python would be a first obvious language to support but in general we'd need one plugin/parser per lang. |
Alternative to this is to go all Python. We've seen examples when companies use DVC core API to manipulate stage files from Python. It usually means that users do not use |
Not sure whether going "all Python" would be a different implementation than supporting other langs. We'd probably still have to check file extensions/types and print out errors if it's a binary file etc... |
Note: this may be a duplicate of #1577. @casperdcl I think Ivan means an alternative for @drorata is to integrate with the DVC using the Python package (pip installation) instead of using |
Well we could support a user-defined external filter tool (similar to
|
I agree with @casperdcl here. I don't think we should be maintaining a whole bunch of tools for this. Maybe we can write just one, for python, and distribute it on pypi to "set the tone" for others to provide other tools with similar arguments (and possibly command name prefixes like Users of other languages will want to distribute this as a package in their own language. And as long as it's in the PATH and has the same signature, we're fine with it. Edit: same argument spec and same command prefix aren't mandatory but sure would be nice for user experience. |
If anyone in the world wants one of these for JavaScript I can write it, it should be 50 lines or less using a pre-existing parser. |
Are you sure you really want such detailed dependency management @drorata ? Will you remember to update the dependency when somebody comes along and for instance splits the function into two? def foo(bar):
return [ba + 1 for ba in bar] to def foo(bar):
return [fo(ba) for ba in bar]
def fo(ba):
return ba + 1 Do you have an example of "this solution conflicts with good practices of coding" to make it a little less abstract? |
@elgehelge good point, but I think it's fine for
I see this as part of DVC's "Makefiles-but-better" bit, where there's sub-file granularity. The question is, does this feature make bad practice so easy for end-users to slip into that DVC should stay away from it? I don't think so. If nothing else the end-user btw I assume CC @efiop |
@elgehelge Good point indeed but still, I believe you can face many cases where this conflict between good coding practices and dependencies can happen. For example, assume that stages 1 and 3 need to fetch data from S3 and stages 2 and 3 need to fetch data from MongoDB. Now, you have two utility functions that take care of fetching the data from the two different sources. From the codebase's structure point of view, I would argue that it make sense to keep these two utility functions inside the same |
It is a very good idea to unify it to the command level. So, DVC can support dependencies for any language or tables in databases (#1577). There is also an opportunity to go further to the unification direction and make it general for other types of "conditional" dependencies such as hyper-parameters from params file (#3393) and Python dependencies as a specially important use case (as @shcheklein mentioned). The unification can look something like this:
I don't see any issues so far (I just don't like the term
Right.
Very good point, @elgehelge! I don't think it makes sense to go too deep into function code and all the dependencies (it's okay to trigger stage if code was refactored without any semantic changes like your example). Also, for simplicity, we can agree to ignore function dependencies changes (so, no checks for |
I would recommended against a
Should be:
|
Type python is a "Syntactic sugar". So, I agree, it is better not to start with it and implement down the road if it's needed. |
In addition.... Some external automation tools (like hyperparameters automation/visualization, CD4ML, DVC itself - |
So we're down to 2 options: type: cmd # or potentially other things
cmd: custom_filter or filter: custom_filter I still prefer the latter (and as a minor point also prefer the word "filter" since that's what So really it seems like there's only one option: type: file # optional, assumed `file` by default
cmd: custom_filter # optional
|
We need to specify the way how to make checks and what are the requirements for the user's commands/ The requirement for the user code/
Also, I quite don't understand the analogy with Git filters. Is it about |
I was intending it to always be run (unconditionally) but good point that this may be inefficient. Maybe we need a hash both before and after path: utils.py
md5: 1q2w3e4r
cmd: custom_filter # if `md5` changed, will be run to produce `cmd_md5`
cmd_md5: 7u8i9o0p or: path: utils.py
path_md5: 1q2w3e4r
cmd: custom_filter # if `path_md5` changed, will be run to produce `md5`
md5: 7u8i9o0p
I disagree with both options. I'd expect a modification of 2:
I refer to e.g. |
It might be some misunderstanding... The whole point of the specific dependencies (this issue as well as hyper-param issue #3393) - not to run stage when the user's code (
Kinds of...
These are not options. These are 2 steps that you need to run to check if
|
@dmpetrov definite misunderstanding here. In the issue here I'm referring to This is completely and utterly different from the usual (current) definition of md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py # <-- user cmd
deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4 # <-- from `cat path | md5sum`
path: train_model.py
- md5: 809776eaeef854443d39e4f76bd64c84 # <-- from `cat path | filter | md5sum`
path: utils.py
filter: python extract_function.py --name check_db # <-- user filter
md5_orig: 500f914dd06cf1f588b8e281c8fb3dd8 # <-- optional, from `cat path | md5sum`,
# used to determine if we should bother re-running `filter`
outs:
- md5: 47b3781954c760e1bd19ac9e56e141b1
path: model/mymodel.pkl Where "user filter" is as equivalent to I also prefer to do without
No, the user never runs
Yes, exactly what About the confusion, I think you were probably talking about something completely different along the lines of simplifying: md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py
deps:
- md5: 53184cb9a5fcf12de2460a1545ea50d4
path: train_model.py
- md5: 809776eaeef854443d39e4f76bd64c84
path: /tmp/__check_db
outs:
- md5: 47b3781954c760e1bd19ac9e56e141b1
path: model/mymodel.pkl
...
md5: 34984b7d96f970d23bb38cda7d7b3f1a
cmd: cat utils.py | python extract_function.py --name check_db > /tmp/__check_db
deps:
- md5: 500f914dd06cf1f588b8e281c8fb3dd8
path: utils.py
outs:
- md5: 809776eaeef854443d39e4f76bd64c84
path: /tmp/__check_db |
@casperdcl thank you for the clarification. It looks like we have less misunderstanding than I thought. And now I understand where your filters come from - filter a part of the source code file.
Anyway we need an output from user defined function (good opportunity to introduce term UDF to DVC 😀 ). The misunderstanding was - should we calculate md5 from the output or not. If we do md5:
Any of this approaches can work just fine. I’m good with md5. And we can easily extend this later if needed. One more addition, in any of these approaches DVC has to check the command return code as I mention before - no need to run md5 check if
Totally agree. The check/ So, it looks like the only disagreement here is the name. I got your analogy with filtering function code from a code file but the code check is not the only option here. Even for source code deps checks, it is not easy to imagine that someone will be filtering an actual code lines. I’m sure that even for a simple checks we will be dealing with bytecode or different forms of abstract-syntax-trees (ASTs). I guess, this is why I didn’t get the filtering analogy and I’m not sure other users will understand the analogy. I suggest using |
There is one more subject to discuss - do we need the checksums from user defined functions in dvc-file or we can store them in the dvc DB (
Implementation detailsIf someone does not know about DVC DB. There is an analogy between the current FS timestemp optimizations and this one. And we can generalize this kind of optimizations which should improve the codebase. If you try to imagine how dvc-file would look like without DVC DB you will see something like:
That means that DVC needs to check file timestemp and if it does not match then it should calculate a new checksum and save it with the timestemp in dvc-file. The idea of moving the checksum to DVC DB seems elegant to me. I don’t see any concerns for now. @iterative/engineering could you please take a look at this approach. Is there something that I missed? |
When you say
Do you mean the hash of the UDF/ In which case do we need to store checksums for filters at all? I thought we'd agreed (for now) that they should be run unconditionally so I don't see what said checksums could be used for. |
Yes.
🤔 when you run UDF (unconditionally) and it returns a checksum how you decide if the dependency changed and the stage needs to be re-run (if you have only an actual dependency checksum but not UDF checksum)? Let's take one step back. I don't think we can use UDF checksum as the file checksum ( |
@casperdcl before we jump to cross-system reproducibility let's make sure we are on the same page... It looks like you missed the UDF checksum from the file ( EDITED: I expected it to have something like md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py # <-- user cmd
deps:
- md5: 809776eaeef854443d39e4f76bd64c84 # == `cat path | filter | md5sum`
path: utils.py
filter: python extract_function.py --name check_db # <-- user filter
udf_hash: d5964e5fb17c88bbed0d8
- md5: fd2b169580906cd21350d47b041461b4 # `cat path | md5sum`
path: utils.py EDITED 2: the last checksum |
The stage's checksum is the UDF's output's checksum. We run the UDF unconditionally and anything that depends on it will "see" a different checksum (and thus a "changed stage") only if the checksum has changed. Am I missing something super basic about DVC's use of checksums? Still don't see a need for |
I got it. I misinterpreted the phrase "How about appearing twice?". You suggest to keep both of the checksums but in different places of the dvc-file. Yeah, it is the same as md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py # <-- user cmd
deps:
- md5: 809776eaeef854443d39e4f76bd64c84 # == `cat path | filter | md5sum`
path: utils.py
filter: python extract_function.py --name check_db # <-- user filter
- md5: fd2b169580906cd21350d47b041461b4 # `cat path | md5sum`
path: utils.py or md5: fd75c91bb60baf9bf36e4f64f0c1c7e0
cmd: python train_model.py # <-- user cmd
deps:
- md5: fd2b169580906cd21350d47b041461b4 # `cat path | md5sum`
path: utils.py
udf: python extract_function.py --name check_db # <-- user filter
udf_hash: 809776eaeef854443d39e4f76bd64c84 # == `cat path | filter | md5sum` Ok, it does not matter at this point. |
Let's return back to this cross-systems checksum question. First, storing checksums in a local DVC DB does not affect reproducibility across systems. However, it affects performance a bit - in a new, cloned repo you have to rerun every stage "protected" by UDF because these udf-checksums are not transferable from machine to machine. In exchange, we improve dvc-files readability by getting rid of these UDF checksums. Second, once build/run-cache is implemented (#1871 (comment), originally is here #1234) we can transfer the udf-checksums through the cache (and storages) without over polluting dvc-files. The biggest concern that I have today regarding the udf-checksums in DVC DB - how well it fits the current DVC architecture. |
Yes, that's what I meant; should have been more precise. I also think my suggestion (splitting into
That's a bit of an issue. I presume going down the route of adding a This takes us all the way back to my first suggestion (albeit now we fully understand it? 🙂) #3439 (comment) - md5: 809776eaeef854443d39e4f76bd64c84 # `cat $path | $filter | md5sum`
path: utils.py
filter: python extract_function.py --name check_db # user filter
md5_orig: 500f914dd06cf1f588b8e281c8fb3dd8 # <-- optional, from `md5sum $path`,
# used to determine if we should bother re-running `$filter` which needs to be tweaked to: - md5_filter: 809776eaeef854443d39e4f76bd64c84 # `cat $path | $filter | md5sum`
path: utils.py
filter: python extract_function.py --name check_db # user filter
md5: 500f914dd06cf1f588b8e281c8fb3dd8 # `md5sum $path` or even: - filter_hash: 'def check_db():\n pass' # `cat $path | filter`
filter_type: raw # default `md5`. Using `raw` would skip the md5 sum,
# (e.g. leave user-readable time stamp)
path: utils.py
filter: python extract_function.py --name check_db # user filter
md5: 500f914dd06cf1f588b8e281c8fb3dd8 # `md5sum $path` |
Right, but it needs to be inverted. The key has to be the file checksum, not UDF. - md5: 500f914dd06cf1f588b8e281c8fb3dd8 # `md5sum $path
path: utils.py
udf: python extract_function.py --name check_db # user filter
udf_md5: 809776eaeef854443d39e4f76bd64c84 # `cat $path | $filter | md5sum` |
Yes that's what I said :) EDIT: Ah I only swapped the key but not the comment. So not quite what I said, but what I meant to say. So... fine for me to be assigned unless someone else wants to tackle this. |
@casperdcl one of the important questions that left is where we store the checksum: dvc-file, DVC DB or run/build-cache (which is not implemented yet). Initially, we can just save in dvc-file. But I'd love to hear @iterative/engineering opinion. |
I'd prefer DVC-file for now. The analogy between Not sure if this has been discussed but when you say run/build cache, do you mean having different classes of cache (e.g. not synced with a remote, manually syncable, etc)? |
I don't see any issues with DVC DB except the warm-up run (the first mandatory run after repo cloning). If we can get rid of checksums in dvc-files completely - even better. run/build cache is about caching run/files checksum globally and storing them in remote. Today, DVC can make repro/not-repro decision based only on the last run only (not runs a few commits back in Git history). Also, this is the way of moving (some of) checksums away from dvc-files and make it more human-readable. The most recent discussion related to run/build-cache: #1871 (comment) |
I think some users would care about the warm-up. Run/build cache sounds great. Different issue though. |
My 2cs:
Looks cool overall. |
A few questions to consider, answer:
|
Had a private chat with @dmpetrov . Summary we both agree that neither The biggest issue with |
suggestions (in order of preference):
I think this is also a ranking by word familiarity to non-programmers so really best all round. |
chiming in bc i get notifications from this post- "filter" has a very specific meaning to many data scientists that is particular to relational algebra. it looks like you are referring to a broader set of functions being encompassed in |
"modifier"? |
The API has changed a bit now in v1 deps:
- md5: 409a438ee6ac70fc1f911631412325ea
path: utils.py is now simply deps:
- utils.py This complicates things a bit in terms of how to save the "filter" command seeing as we now only have a deps list rather than dict. Can't really think of anything better than having some sort of "virtual" deps, i.e. a dep which is a command to execute and read stdout. Either: deps:
- 'cmd:python extract_function.py --name check_db' # `cmd:` prefix signifies UDF/filter or: deps: []
deps_cmd:
- python extract_function.py --name check_db Or we can go with supporting a mix of string & dict types (maybe easiest?): deps:
- some_other_file.py
- path: utils.py
udf: python extract_function.py --name check_db |
Good point re dvc 1.0, @casperdcl
This looks like the best option so far. |
People will start to use it to make some external things like db state a dependency. |
after discussion with @efiop: path: utils.py
udf: python extract_function.py --name check_db should mean either Also, a better spec seems to be: path: utils.py
udf:
cmd: python extract_function.py --name check_db
#md5: s0m3h45h # in dvc.lock but not dvc.yaml or even (similar approach to params): # in dvc.yaml
utils.py:
cmd: python extract_function.py --name check_db
# in dvc.lock
path: utils.py
cmd: python extract_function.py --name check_db
md5: s0m3h45h |
I'm interested in this issue, so forgive me for asking if it is still under consideration. For me, the use case would be to filter for the |
Going to +1 this as well. I typically like to organize projects around larger modules with a CLI, and defining dependencies inside would be helpful. Syntax like this seems to follow similar to the my_stage:
deps:
- mymodule.py:
- filter: python -c "import inspect, mymodule; print(inspect.getsource(mymodule.myfunc))"
- the-plan.csv: # I only want to pay attention to a single row of this CSV
- filter: python -c "import pandas as pd; print(pd.read_csv('the-plan.csv').iloc[3])" So, |
Assume we have 4 stages:
0
and2
depend on the functionfoo
1
and3
depend on the functionbar
In order to reduce code duplication we want to put the utilities
foo
andbar
in one place; let them be inutils.py
. So,utils.py
becomes a dependency of all four stages.Now... after some time we realize we have to update something in
bar
. So we editutils.py
. This, in turn, invalidate all stages. Obviously, a simple solution is to haveutils_0_2.py
which will be a dependency of0
and2
and similarlyutils_1_3.py
. But this solution conflicts with good practices of coding.A possible solution could be to adopt some notation like the one used in
pytest
and specify the function insideutils.py
. For example, something like:This is a continuation of this discussion.
The text was updated successfully, but these errors were encountered: