Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stage collection: ignore git-ignored directories? #5244

Closed
skshetry opened this issue Jan 11, 2021 · 12 comments · Fixed by #5265
Closed

stage collection: ignore git-ignored directories? #5244

skshetry opened this issue Jan 11, 2021 · 12 comments · Fixed by #5265
Assignees
Labels
discussion requires active participation to reach a conclusion question I have a question?

Comments

@skshetry
Copy link
Member

skshetry commented Jan 11, 2021

dvc status in our repo is terribly slow (~6s), and I think a project of a reasonable size of our users are slow as well, as we might be traversing through very dense directories (eg: node_modules/venv, and user's dvc-tracked data).

We have discussed before to get rid of .dvcignore, but I see it to be a different thing, useful in, for example, ignoring .DS_Store/logs, etc in dvc-tracked artefacts.

We could opt for a simple solution and just ignore the git-ignored dvcfiles and directories and never traverse them. In addition to this, as we also add user's data in the gitignore, we can also get significant speedups just by doing this.

Also, design-wise, we won't need to mix gitignores and dvcignores in the trees.

Performance improvements in dvc's own repo for status

Before
  Time (mean ± σ):      4.240 s ±  0.192 s    [User: 3.535 s, System: 0.474 s]
  Range (min … max):    4.141 s …  4.781 s    10 runs

After

Benchmark #1: dvc status with dulwich cached
  Time (mean ± σ):     512.1 ms ±  73.2 ms    [User: 387.6 ms, System: 43.3 ms]
  Range (min … max):   436.2 ms … 704.0 ms    10 runs

Also, note that we can change this behaviour in 2.0 release, as this could be considered a breaking change.

@skshetry skshetry added question I have a question? discussion requires active participation to reach a conclusion labels Jan 11, 2021
@skshetry skshetry mentioned this issue Jan 11, 2021
2 tasks
@efiop
Copy link
Contributor

efiop commented Jan 12, 2021

Great suggestion! We should think about scenarios that this might break, but it looks like it won't break any scenarios that use dvc get/import/list functionality, as they pretty much merge git and dvc anyway and so .gitignore is always taken into account and gitignored dvcfiles were never shown there.

@skshetry
Copy link
Member Author

looks like it won't break any scenarios that use dvc get/import/list functionality

This proposal is only regarding the user's workspace. Even though the user might have git-ignored dvcfiles (maybe to prevent it from getting updated), we'll still read what's checked-in the git. So, it does not apply to those commands at all.

One thing that comes to mind is metrics/params/plots commands. Git ignoring them will make those commands show empty values/em-dash, but I think, it does not hurt that much as they are explicitly ignoring them.

@karajan1001
Copy link
Contributor

As all of the DVC tracked data would be added into .gitignore, can stage collection only consider positions tracked by .gitignore?

@karajan1001
Copy link
Contributor

Tried on my computer,

image

DVC creates .gitignore inside the directory, we still have to look into every directory for collecting them.

@skshetry
Copy link
Member Author

we still have to look into every directory for collecting them.

@karajan1001, we do need to look into every directory, except the git-ignored ones which include the dvc-tracked data as well. Our assumption is that the user's repo is not usually dense (if we ignore dvc-tracked data and the git-ignored files like node-modules, venv, __pycache__, etc.)

Users could just use .dvcignore, but not that many people know about that (and, feels like it has it's own niche use case of ignoring inside the outputs) or use it actively. This should help us improve performance out of the box without any configuration by the user.

@jorgeorpinel
Copy link
Contributor

Hi! Thanks for the RFC on Slack. A couple general questions to make sure I understand this:

We have discussed before to get rid of .dvcignore

Can you clarify whether you are proposing to get rid of .dvcignore here?

just ignore the git-ignored dvcfiles and directories and never traverse them

Again, are we suggesting to rely on .gitignore (deprecate .dvcignore)? B/c there could be files you want DVC to ignore but Git to track.

we won't need to mix gitignores and dvcignores in the trees.

Can someone explain what they refer to when talking about trees? I assume it has to do with directory structures but never been sure.

Thanks

@skshetry
Copy link
Member Author

Can you clarify whether you are proposing to get rid of .dvcignore here?

We had discussions before on getting rid of .dvcignore and just use .gitignore everywhere. There's one usecase of .dvcignore: to ignore files inside the dvc-tracked directories. But, since we also use .gitignore to ignore dvc-tracked directories, by deprecating .dvcignore, we won't be able to provide that usecase, as detecting either of those could be tricky.

So, no, I am not proposing to get rid of .dvcignore, as I see that important usecase of it.

Again, are we suggesting to rely on .gitignore (deprecate .dvcignore)? B/c there could be files you want DVC to ignore but Git to track.

Yes, but what's the usecase of that? That's what we need to understand.
Sure, there might be some hypothetical usecase. But, is that important enough for us to not consider this the default? We can provide a way around this with a config if required.

Can someone explain what they refer to when talking about trees? I assume it has to do with directory structures but never been sure.

This is an internal abstraction. We had discussions months before about using both .gitignore and .dvcignore to speedup DVC's traversal. But, it complicates our internal.

What I am proposing here is just use this at a high level, when searching for the stages, which provides almost same benefit, without too much complications.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 15, 2021

What I am proposing here is just use this at a high level, when searching for the stages

OK, I was confused by the dvc status-centered description, because that also looks for changes in tracked data. But I think you're limiting the proposed change to traversing for building/validating DAGs (right?).

there could be files you want DVC to ignore but Git to track.
but what's the usecase of that?

Yes, good Q. It's a bit confusing because unlike Git, DVC already ignores everything by default (the Git analogy comes back to haunt us again...) but:

🔹 I think you could safely Git-track a file/dir created inside a DVC-tracked dir by entering its pattern to .dvcignore AND a matching ! pattern to .gitignore. If it's a dir (tree) containing dvc.yaml file(s), would that break with this change? (see #5244 (comment)) What would the way around via config look like?

Users could just use .dvcignore, but not that many people know about that (and, feels like it has it's own niche use case of ignoring inside the outputs

We could try to emphasize that usage of .dvcignore in docs. Even add it to the Best Practices section (when that happens... see iterative/dvc.org/issues/72). No point in doing that if this gets implemented though.

@skshetry
Copy link
Member Author

I think you could safely Git-track a file/dir created inside a DVC-tracked dir by entering its pattern to .dvcignore AND a matching ! pattern to .gitignore. If it's a dir (tree) containing dvc.yaml file(s), would that break with this change?

dvc-tracked directory must not contain a dvc.yaml/.dvc file. Or, maybe I did not get your question?

What would the way around via config look like?

It could be implemented when someone asks for, with a clear use-case. It could be core.no_ignore or something similar.

We could try to emphasize that usage of .dvcignore in docs

People might not like to maintain .gitignore and .dvcignore. At least, not in the same way they do in .gitignore. I have seen users using .dvcignore in the case of ignoring inside outputs, but generally, not on other cases.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 15, 2021

Performance improvements in dvc's own repo for status

BTW I tried $ hyperfine 'dvc status' in my local copy of dvc and got 756.4 ms before, and 283.9 ms after putting .env/ in .dvcignore (Ubuntu on WSL2). Significant but not AS extreme as yours, I wonder why yours goes so high initially.

@skshetry
Copy link
Member Author

skshetry commented Jan 15, 2021

Significant but not AS extreme as yours

I have 4 virtual environments inside .env for each of the python version. And, .pyc and __pycache__ files.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 15, 2021

maybe I did not get your question?

Wasn't a question but never mind: "It is not possible to re-include a file if a parent directory of that file is excluded." (from gitignore)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion question I have a question?
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants