Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Special dir warning #3436 #4229

Merged
merged 12 commits into from
Aug 12, 2020
Merged

Conversation

aerubanov
Copy link
Contributor

@aerubanov aerubanov commented Jul 17, 2020

Added a warning message when traversing special directories and a test for that. Also removed unused arguments in a couple of tests in tests/func/test_ignore.py

@skshetry skshetry linked an issue Jul 17, 2020 that may be closed by this pull request
dvc/ignore.py Outdated
Comment on lines 152 to 168
special_dirs = {
".petest_cache",
".mypy_cache",
"node_modules",
".venv",
"venv",
"env",
"__pycache__",
"dist",
"eggs",
"wheels",
"htmlcov",
"docs",
".ipynb_checkpoints",
".vscode",
".github",
}
Copy link
Contributor

@efiop efiop Jul 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, makes me wonder if we should really just ignore them by default (like we do with .git and .hg), and if some user really needs these, he could !something in dvcignore to include it.

Or, we might consider generating a default .dvcignore (kinda like github generates .gitignore when you create a project) to make it extra explicit. Though this is simple on init, but managing this stuff in existing projects is not too simple 🙁

I'm just worried about these warnings getting too annoying. Plus, unless I'm missing something, ignore execution order is not deterministic (that's actually is a bigger problem for us in general), so you never know if this filter will be executed before or after .dvcignores in which user has ignored a special dir explicitly. 🙁

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing that this makes me think about is having some kind of safe guard, that would, for example, record the amount of time it took to walk through the directory and warn if it didn't find anything but it took, say, more than 5 sec.

Copy link
Contributor Author

@aerubanov aerubanov Jul 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue was opened after this conversation #3257. It was decided that it is better to warn the user if a special directory is found than to ignore it by default. But annoyance from warnings can be a problem. Maybe we should make it possible to disable this warning? From the other hand, the default .dvcignore also good idea because the ignore will be explicite to the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What makes idea of default .dvcignore compelling, is the fact that is not only explicit, but also editable by user.

If we decide to go with warnings, they definitely need turn off mechanism, just like we did with slow_link_guard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I soppouse that the default .dvcignore, is better option. It`s more explicite and easy to configure. Also the list of special dirs always will be incomplite, we can not add all posiible variants here. And default .dvcignore provide additional example for users how to deal with such cases.

@efiop, @pared, if you are agree with me, then I will change the code. Should I close this PR and open a new one later, or commit right here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I love the idea of an explicit .dvcignore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it should be default. Maybe we should add some option to init, or dedicate special command for that? It seems to me that .dvcignore was supposed to let user decide what is part of his project. If we always create default .dvcignore on init, we make it our choice. @efiop @aerubanov, what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that default .dvcignore should consist comment with link to the dvcignore documentation. And we should add option for init to filling .dvcignore with some content. Not all list of special folder, rather based on type of project specified by user. Something like github gitignore template https://github.com/github/gitignore, but with fewer options ofcorse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aerubanov I think you are right on this one.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing stuff @aerubanov ! Thank you! 🙏 Please see a comment above.

Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether logic of finding potentially problematic dirs should be part of dvcignore.

Introducing this change in this form will cause us to check all directories that we visit during traversing.
That might have considerable impact for all users, while it will be helpful for a subset of them.

Maybe we should be doing this check only in a directory that also contains .dvc? It's not the perfect solution, but it would not impact each walk operation.

dvc/ignore.py Outdated

def __call__(self, root, dirs, files):
for d in dirs:
if d in self.special_dirs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could do something like set(dirs).intersection(self.special_dirs) and print single warning for all problematic dirs.

@aerubanov
Copy link
Contributor Author

I add default .dvcignore with link on docs and option for init to filling .dvcignore with some content based on project type specified by user. If there is no need for further changes to the functionality, I will add tests and changes to the documentation.

@aerubanov aerubanov requested review from pared, efiop and shcheklein July 26, 2020 15:04
Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,2 @@
# templates for .dvcignore files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really have to be a module, as there is no actual code. You can either keep it as datafiles or maybe actually make these into the python shape by declaring these as strings. The latter way is easier to manage, as things like setup.py and pyinstaller will have a much easier time collecting these automatically.

@@ -0,0 +1,23 @@
# Compiled class file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I also wonder if these are really worth stealing directly from https://github.com/github/gitignore . They would be hard to maintain, really. But I'm not able to see a better way right now...

no_scm=no_scm,
force=force,
subdir=subdir,
ignore_template_list=ignore_template_list,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People will probably want these even in existing projects, and they will probably forget to do this during the init phase. So we could add another command to install these, but maybe it is actually better to just document that people might want to get .dvcignore from https://github.com/github/gitignore and use it.

This also makes me wonder if we should should really somehow reuse existing .gitignores (e.g. to not search for dvc-files and .dvcignores in .gitignore'd directories). But I suppose many people also want to exclude these dirs from dirs they are dvc adding (or tracking), which doesn't play well, because we gitignore the data we are adding. 🤔

Copy link
Contributor Author

@aerubanov aerubanov Jul 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop, Ok. May be we should add command for downloading template content from https://github.com/github/gitignore and pasting it in .dvcignore? Then, we need not to store it in dvc code base and maintain, and user need not to do it by hand.

And also we can add second option for this command to read existing ,gitignore, exlude all records that was added by dvc add and paste it in .dvcignore. Is it possible to easily get a list of all tracked files? (I think it should be, I just don't know where to look).

What do you think of these two possibilities?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop, Ok. May be we should add command for downloading template content from https://github.com/github/gitignore and pasting it in .dvcignore? Then, we need not to store it in dvc code base and maintain, and user need not to do it by hand.

@aerubanov I don't think it is worth creating a special command. It is not hard to just go to a link in the docs and copypaste contents to .dvcignore (or use wget/curl/etc).

And also we can add second option for this command to read existing ,gitignore, exlude all records that was added by dvc add and paste it in .dvcignore. Is it possible to easily get a list of all tracked files? (I think it should be, I just don't know where to look).

That sounds a bit complex :( We had a discussion with @pmrowla about it, and maybe we could just use .gitignore directly, the only thing that is problematic is not dvcignoring dvc-tracked files, but that should be solvable by telling dvcignore about them when we collect the dag. We will soon have a pre-requisite PR that notifices dvcignore about those outputs for performace reasons and we might get back to this idea right after it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i will make the necessary changes.

dvc/repo/init.py Outdated
Comment on lines 49 to 52
f.write(
"# This is .dvcignore file."
" See more https://dvc.org/doc/user-guide/dvcignore\n"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to create a default dvcignore. It'll make people aware of it. 👍 to that.

Regarding the comment, better to explain what it does. This is .dvcignore file. is clear from the filename.

Something like following, perhaps?

Suggested change
f.write(
"# This is .dvcignore file."
" See more https://dvc.org/doc/user-guide/dvcignore\n"
)
f.write(
"# add patterns of files dvc should ignore, can improve performance\n"
"# Learn more at https://dvc.org/doc/user-guide/dvcignore\n"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skshetry Please see my comment above. Maybe we should just use .gitignore directly. Would love to know what you think about that.

aerubanov added a commit to aerubanov/dvc.org that referenced this pull request Aug 6, 2020
shcheklein pushed a commit to iterative/dvc.org that referenced this pull request Aug 6, 2020
* Add link on templates for .dvcignore.

Related with iterative/dvc#4229

* Restyled by prettier

Co-authored-by: Anatoly <[email protected]>
Co-authored-by: Restyled.io <[email protected]>
@aerubanov
Copy link
Contributor Author

@efiop, @skshetry . I made the changes that we discussed above. Is there more to be done?

@pared pared requested review from efiop and skshetry August 11, 2020 09:53
@efiop efiop merged commit 5cb36fc into iterative:master Aug 12, 2020
@efiop
Copy link
Contributor

efiop commented Aug 12, 2020

Thank you @aerubanov ! 🙏

@aerubanov aerubanov deleted the special_dir_warning branch August 13, 2020 14:06
skshetry added a commit to skshetry/dvc that referenced this pull request Aug 15, 2020
Since iterative#4229, we have started creating .dvcignore file after `dvc init`
This commit only adds that in assertions
skshetry added a commit that referenced this pull request Aug 15, 2020
Since #4229, we have started creating .dvcignore file after `dvc init`
This commit only adds that in assertions

with open(dvcignore, "w") as fobj:
fobj.write(
"# Add patterns of files dvc should ignore, which could improve\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"# Add patterns of files dvc should ignore, which could improve\n"
"# Add patterns of files DVC should ignore, which could improve\n"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warn users when traversing special directories, recommend to ignore
6 participants