Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp run: copy (certain) git-ignored files to tmp folder on --run-all #5800

Closed
Tracked by #8972
jdonzallaz opened this issue Apr 12, 2021 · 39 comments · Fixed by #9302
Closed
Tracked by #8972

exp run: copy (certain) git-ignored files to tmp folder on --run-all #5800

jdonzallaz opened this issue Apr 12, 2021 · 39 comments · Fixed by #9302
Assignees
Labels
A: experiments Related to dvc exp feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint

Comments

@jdonzallaz
Copy link

jdonzallaz commented Apr 12, 2021

UPDATE: Jump to #5800 (comment)

When running experiments with --run-all (with some experiments in the --queue), the repo is copied to the .dvc/tmp/ for each experiment to run. The problem is that the git-ignored files are not copied to this folder, and the commands that depend on those files fail.
The files that are git-ignored but are defined as stage dependencies should be copied.
Use-case is the following: Some config files containing sensible data (API keys, password) are not pushed to the git repo but are still needed by certain commands.

See discord discussion.
CC: @shcheklein @pmrowla

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 13, 2021

Hey @jdonzallaz per #5029 it's possible to include untracked files in tmp exp dirs by staging them before queueing the experiment. So I think you could try something like

$ git add --force secret.txt
$ dvc exp run --queue
$ git remove secret.txt

⚠️ This would be a workaround! And it may have unwanted consequences like the secrets ending up in Git history.

Let's see what Engineering thinks of this one 🙂

@pmrowla
Copy link
Contributor

pmrowla commented Apr 13, 2021

I don't think that workaround is viable here given that the request here is specifically regarding:

Some config files containing sensible data (API keys, password)

And if an experiment is run this way, those API keys and password will end up in git history

@jorgeorpinel
Copy link
Contributor

Do we need a separate issue to specifically prevent this workaround? B/c if I thought about it someone else may figure it out and it can lead to some hard-to-predict security risks.

@pmrowla
Copy link
Contributor

pmrowla commented Apr 13, 2021

There's not really anything we can do to prevent people from doing this, users can add whatever they want into git.

I think anyone using git add --force on a file containing secret credentials should already be aware that this will add that file into git, since that's what git add --force does by definition?

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 13, 2021

that's what git add --force does by definition?

Not without git commit. In this scenario people would stage the secrets specifically because we copy staged (not committed) files into tmp exp dirs.

I keep coming back to the idea of not using Git as a proxy UI to DVC's behavior, but that's discussed is in #5801.

BTW, from #5801 (comment) @pmrowla:

in theory, stuff like (properly gitignored) venvs also needs to be copied over

Feels like too much. Is it common to have something like venv/ as a stage dependency? Or are we expanding the scope here (looking ahead)? Thanks

@pmrowla
Copy link
Contributor

pmrowla commented Apr 13, 2021

Not without git commit. In this scenario people would stage the secrets specifically because we copy staged (not committed) files into tmp exp dirs.

And anything that is staged will be included in the experiment commit, so that it can be reproduced again later.

Feels like too much. Is it common to have something like venv/ as a stage dependency? Or are we expanding the scope here (looking ahead)? Thanks

I don't think it would be listed as a dependency, but some user stages may still depend on that directory existing. For example, I could have a pipeline stage with the command set to .venv/bin/python my_script.py

@dberenbaum
Copy link
Collaborator

@jdonzallaz For your particular use case, how important is it to have the credentials files as relative paths within your repo? Are there workarounds for you like using environment variables, absolute paths, or making paths relative to your home dir?

This isn't meant to dismiss the issue but to gather some more context about when and why it's important to have files like this accessible as gitignored files within the repo itself.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 14, 2021

pipeline stage with the command set to .venv/bin/python my_script.py

Any script used in cmd run should be listed under deps right? In this case pretty tricky as .venv/ is probably Git or DVC ignored, so would DVC even cache it? Interesting Q for sure, but it feels like a separate issue/discussion (we could open it though).

@pmrowla
Copy link
Contributor

pmrowla commented Apr 14, 2021

Any script used in cmd run should be listed under deps right? In this case pretty tricky as .venv/ is probably Git or DVC ignored, so would DVC even cache it? Interesting Q for sure, but it feels like a separate issue/discussion (we could open it though).

Standard practice would only be to include my_script.py as a dep. If I have a stage that calls /usr/bin/bash my_script.sh, I don't add /usr/bin/bash as a pipeline dependency.

And yes, standard practice would also be for users to gitignore a venv directory, the same way that they should be gitignoring __pycache__. But in real world use cases, not everyone is gitignoring everything that they should be

@jdonzallaz
Copy link
Author

Ah, I see this issue is tricky. I've also checked the other related issues.

To answer your question @dberenbaum, there is no technical reason the credentials files could not be in another, absolute path. However feels like a hack and does not feel right to not have project-related files in the project folder, and we should add another config entry to tell where is the credential file.

At the moment, the only possibility is to git add -f / git remove. It works as a workaround but is not really convenient.
I now understand why adding all files (even git-ignored) can be a problem because of folders like venv. I do not expect those to be copied though.

I feel that git-ignored files that are explicitly listed as stage dependencies (like my credentials files) should be copied to the temp folder (without relying on git). This either automatically or through a CLI option.

@jorgeorpinel jorgeorpinel added discussion requires active participation to reach a conclusion feature request Requesting a new feature labels Apr 14, 2021
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 14, 2021

I don't add /usr/bin/bash as a pipeline dependency.

That seems different. .venv/bin/python is a relative path inside the workspace.

in real world use cases, not everyone is gitignoring everything that they should

Agree but I'm not sure what you're suggesting @pmrowla, how do we even detect things that *should* be gitignored? I don't know the answer but again, it feels like a separate Q. I'd open another issue but TBH I don't know there's much DVC can do about that.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 14, 2021

At the moment, the only possibility is to git add -f / git remove.

@jdonzallaz again please be aware that the files will end up in the Git history this way, even if you don't explicitly git commit them.

I feel that git-ignored files that are explicitly listed as stage dependencies (like my credentials files) should be copied to the temp folder (without relying on git).

Yes! It's something we're considering (#5816): to have way(s) to tell DVC which (if any) untracked files to include in the queued exp, but knowing that DVC won't ever put them in Git. DVC would just cache them locally. And when/it you dvc exp apply that experiment, they would be restored to the workspace as untracked files again. WDYT?

@jorgeorpinel jorgeorpinel removed the discussion requires active participation to reach a conclusion label Apr 14, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Apr 15, 2021

Agree but I'm not sure what you're suggesting @pmrowla, how do we even detect things that should be gitignored?

My point was that we can't do this, and that is a reason why we should not allow blanket adding untracked files. We should make the user be explicit.

@jdonzallaz
Copy link
Author

@jorgeorpinel Ah you're right, I forgot the part about the files ending in the Git history.
What you propose seems the best solution.

Does having the untracked file in the stage dependencies is being explicit ?
Otherwise, having an experiment config file or having CLI options look like good solutions to me.

@dberenbaum
Copy link
Collaborator

@jdonzallaz Are you collaborating with others on your project? Are these personal or shared credentials? I'm wondering how this would work on a team where everyone has personal credentials as dependencies, since it seems everyone will have to re-run each pipeline since the credentials will differ for everyone.

@pmrowla @jorgeorpinel Would this be solved by being able to cache data locally only (data gets ignored by dvc push and pull)? It seems like that would enable users to cache sensitive data. We have some other issues open related to fine-grained control of file-level permissions and remotes, so maybe this is a use case for those features.

@jorgeorpinel
Copy link
Contributor

being able to cache data locally only (data gets ignored by dvc push and pull

Yep, I think that's the idea that's forming in #5816 (comment).

@dberenbaum
Copy link
Collaborator

On second thought, I don't think a local cache would even be the right solution for credentials, because if they change, you probably want to use the updated version and not the cached version.

@jdonzallaz
Copy link
Author

I'm not collaborating with other on this project. We have another project with DVC and collaboration, but the credentials (for the remote) are in the config.local. In this project, the credentials are (can be) shared, as they are specific to the project (and not the user).
But you're right that personal credentials will make the pipeline run again. I didn't think about this.

@jorgeorpinel jorgeorpinel changed the title exp: copy git-ignored files to tmp folder on --run-all exp run: copy (certain) git-ignored files to tmp folder on --run-all Apr 16, 2021
@jorgeorpinel
Copy link
Contributor

Hi! Another idea from #1416 (comment) is to setup some mechanism (e.g. env vars) so that queued experiments can use certain assets from the original workspace directly (without moving them to the tmp dir or committing them to the exp commit).

@dberenbaum dberenbaum added the A: experiments Related to dvc exp label Jun 16, 2021
@gregstarr
Copy link

Hello, when running experiments in parallel, I am finding that local cache settings are ignored.

I am using dvc 2.9.2.

when I am in my repo:

(venv) [me@computer folder]$ dvc config -l
remote.remote.url=//remote/url
core.remote=remote
remote.remote.url=//local/remote/url
cache.dir=/cache/dir
cache.type=hardlink,symlink

but when I cd to one of the temporary experiment directories:

(venv) [me@computer tmp0zpspb9w]$ dvc config -l
remote.almds.url=//remote/url
core.remote=remote
cache.dir=/cache/dir

I mentioned this on Discord and was told that this could be relevant to this issue.

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Feb 6, 2023
@daavoo daavoo removed the p1-important Important, aka current backlog of things to do label Feb 14, 2023
@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Mar 13, 2023
@dberenbaum dberenbaum added p3-nice-to-have It should be done this or next sprint and removed p1-important Important, aka current backlog of things to do labels Mar 31, 2023
@daavoo daavoo self-assigned this Apr 3, 2023
@daavoo daavoo added this to DVC Apr 3, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Apr 3, 2023
@daavoo daavoo moved this from Backlog to In Progress in DVC Apr 3, 2023
daavoo added a commit that referenced this issue Apr 4, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
daavoo added a commit that referenced this issue Apr 4, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
@daavoo daavoo linked a pull request Apr 4, 2023 that will close this issue
@daavoo daavoo moved this from In Progress to Review In Progress in DVC Apr 4, 2023
daavoo added a commit that referenced this issue Apr 4, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
daavoo added a commit that referenced this issue Apr 4, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
daavoo added a commit that referenced this issue Apr 4, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
daavoo added a commit that referenced this issue Apr 4, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
daavoo added a commit that referenced this issue Apr 5, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
daavoo added a commit that referenced this issue Apr 5, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
daavoo added a commit that referenced this issue Apr 5, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
daavoo added a commit that referenced this issue Apr 5, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
daavoo added a commit that referenced this issue Apr 5, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
daavoo added a commit that referenced this issue Apr 5, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
daavoo added a commit that referenced this issue Apr 5, 2023
List of paths to copy inside the temp directory. Only used if `--temp` or `--queue` is specified.

Closes #5800
@github-project-automation github-project-automation bot moved this from Review In Progress to Done in DVC Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants