-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
import-url/update: add --no-download flag #8024
Conversation
fed977d
to
65d8fa5
Compare
e5161f1
to
cb533f0
Compare
cb533f0
to
7f3819a
Compare
7f3819a
to
ab40c8b
Compare
@jorgeorpinel I would appreciate your feedback on the flag naming here whenever you can 🙏 |
This comment was marked as outdated.
This comment was marked as outdated.
Nice @dtrifiro! I think the expected usage here is still a little unclear, so I'll try to clarify (cc @dmpetrov):
|
ab40c8b
to
75a2cb8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suggesting the more general term "hash value" since its what we use in https://dvc.org/doc/user-guide/project-structure/dvc-files#dependency-entries (includes md5
, etag
, checksum
fields) but I suppose "checksum" is the most accurate concept so up to you (feel free to edit the suggestions).
Not sure I agree about generating outputs here. In order to actually write the out's hash (
I could add |
How are we getting the checksum for the deps section without streaming it? I see your point, but avoiding streaming the file is not a hard requirement here (that's what Some other thoughts:
Good point, |
We're not: depending on the remote we use various types of metadata (md5, etag, checksum, ...) to check whether the remote file has changed and we need to update it. Some clouds do offer the possibility of getting the md5 checksum (directly or through the etag), but this is not always available/consistent (e.g. on S3 it depends on whether the bucket is encrypted, whether the file has been uploaded with multipart upload).
When running WARNING: stage: 'data.dvc' is frozen. Its dependencies are not going to be shown in the status output.
testfiles3.dvc:
changed outs:
deleted: data
changed checksum Streaming the file to compute the md5, would get rid of |
Thanks @dtrifiro!
It seems like If we start saving the cloud version ID, a typical workflow would be:
Is there a way to accomplish 3? Or, since we don't have cloud version ID support yet, is there a way to try to get the file and fail if it doesn't match the checksum?
We at least need a way to show when the source contents change, which is what's missing now. |
Discussed with @dtrifiro that there are several options.
We can show how 1 works to confirm the workflow makes sense, and then prototype 2 since it seems users will expect this to work in This touches on aspects of the cloud versioning discussion:
|
I'm more inclined to get 1 and 2 ( As @dberenbaum mentioned, I getting |
Yes, not spending time for the download is the core requirement here. I see 3 options:
(1) is preferred. (2) & (3) are ok options.
Yes Also, we need to be careful with |
@dmpetrov I think we want |
3e35588
to
8145d6f
Compare
1dbc4c5
to
634d350
Compare
add --no-download flag to dvc import-url/dvc update to only create/update .dvc files without downloading the associated data. Created .dvc files can be fetched using `dvc pull` or can be updated using `dvc update --no-download`.
What is the expected behavior of
It seems to me that committing a partial import should just be a no-op and silently pass? |
634d350
to
da19dcc
Compare
Also, this is semi-related to #8164, but IMO using DVC will try to pull it from a regular remote (and Basically, given this scenario, I don't think it's obvious to the user why the 2nd and 3rd
|
What if there was something in the workspace? |
It will give you the "output has changed are you sure you want to commit" prompt, and if you commit it we overwrite the |
For the record: discussed with @dberenbaum and @pmrowla that the |
@efiop Is |
@dberenbaum Good point, for some reason I thought this wasn't broken before. Yeah, let's merge and move on then. Thanks for a great discussion folks, the initial implementation seemed deceptively simple, but you did a great job here figuring it out properly. 🙏 |
if deps == stage.deps: | ||
stage.outs = outs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what this is doing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few lines above we clear
the stage's outs (delete hash_info, meta, and obj from the out).
This is important because when using no_download == True
, old outputs could be retained (even if deps are changed).
If the deps did not change, we can restore the previous outs since they did not change either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the .clear()
mutates the output itself right? Isn't stage.outs[0]
and outs[0]
the same instance here?
stage.save_deps() | ||
stage.deps[0].download(stage.outs[0], jobs=jobs) | ||
if check_changed and not old_hash_info == stage.deps[0].hash_info: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record (unrelated to this PR, it does it correctly for now): in the short future this should really be a meta check rather than hash_info, since etag
s are really not hashes.
partial_imports = chain.from_iterable( | ||
self.index.partial_imports(targets, recursive=recursive) | ||
for _ in self.brancher( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thing to note is that this change makes us call brancher more than once on fetch - first to collect regular/cached used objects and then a second time to collect partial imports. This is expensive if we are fetching multiple revs (i.e. the user uses --all-commits
).
Ideally we want to walk the repo once per revision, and collect everything we need in that one pass.
failed = 0 | ||
for stage in repo.partial_imports(targets, **kwargs): | ||
try: | ||
stage.run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using run()
/sync_import()
here means that this is always a pull
and not really a fetch
. The import will be downloaded into the workspace and then saved/committed to cache, but the output will remain "checked out" in the workspace, even if the user ran fetch.
in this workflow I would not expect file
to be present in my workspace (it should only be in .dvc/cache
to be checked out later):
$ dvc import-url "azure://test-container/dir/file" --no-download
Importing 'azure://test-container/dir/file' -> 'file'
$ ls
file.dvc tags
$ dvc fetch
Importing 'azure://test-container/dir/file' -> 'file'
1 file fetched
$ ls
file file.dvc tags
add
--no-download
flag todvc import-url
/dvc import
/dvc update
to only create/update.dvc
files without downloading the associated dataCloses #7918