Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a .pre-commit-hooks.yaml to enable hooks to refer to this repo #691

Closed
JohnPaton opened this issue Dec 18, 2020 · 18 comments
Closed

Add a .pre-commit-hooks.yaml to enable hooks to refer to this repo #691

JohnPaton opened this issue Dec 18, 2020 · 18 comments
Milestone

Comments

@JohnPaton
Copy link
Contributor

The current documentation suggests to use repo: local to use jupytext as a pre-commit hook with pre-commit. However, this requires the user to have jupytext installed in the environment they are working in when committing.

To avoid version conflicts etc, pre-commit usually creates a virtualenv per repo and runs the hooks from there. To do this, relies on hook providers to have a .pre-commit-hooks.yaml file. Adding this here would allow users to use repo: https://github.com/mwouts/jupytext to make a dedicated jupytext environment within pre-commit, instead of relying on users to have it installed in their current env. As an added bonus, jupytext could provide a "menu" of common hooks, if desired (e.g. "always convert .ipynb to .py", "always unstage .ipynb", etc). These are configurable, but not required.

@mwouts
Copy link
Owner

mwouts commented Dec 28, 2020

Hi @JohnPaton , thank you for your recommendation. I don't have much experience with .pre-commit-hooks.yaml files, but for sure I'd be interested in adding one. Would you be open to make a PR with a first version of that file? Thank you in advance.

@JohnPaton
Copy link
Contributor Author

Hey @mwouts, sure, I can take a swing at it. I myself am not super familiar with jupytext (yet, it looks like an amazing project so looking forward to getting deeper into it), so I'll propose a few hooks to provide and we can chat about the details.

@JohnPaton
Copy link
Contributor Author

JohnPaton commented Dec 30, 2020

So I've been playing around a bit, and it seems like the best option would be to have some way to both:

  • convert the filenames as passed (what normal Jupytext does, pre-commit already has logic for which files to pass), and
  • git add any newly created files (edit: and exit non-zero in this case to fail the hook)

The reason for this is that pre-commit expects that a) hooks only modify files that are already tracked, and b) any modification means the hooks "fail". This way users can inspect the changes before adding and committing them, with the idea that the hooks will not make any further changes the second time around. In our case, if a notebook has been updated, then its corresponding .py will also change, so the hook will fail and the user will need to add the .py as well before being able to commit. This is how most pre-commit hooks work. However, we also have the additional case here where a new notebook is created and added. Then new .pys will also be created, but since these aren't in the index or the tree pre-commit will ignore them, and the commit will go ahead without those new files. For that reason, only newly created files should be added by Jupytext in this mode.

I'm happy to help build something like this, but since it seems it will be a bit more than just a new .pre-commit-hooks.yaml, I wanted to run it by you @mwouts and see if you are actually interested in having this tight integration with pre-commit be part of Jupytext. For what it's worth I think this is really the perfect way to use Jupytext which is why I'm willing to spend some time on it.

@mwouts
Copy link
Owner

mwouts commented Dec 31, 2020

Hi @JohnPaton , thank you so much for your PR!

Thanks for pointing out at the need to add the files that Jupytext may create, sure you're right, it's how it should be done.

Regarding testing, we have a series of tests on the (other) pre-commit mode in test_pre_commit.py. By the way, do you think we should keep the --pre-commit argument and the custom pre-commit hooks examples? Maybe the work that you're doing will soon deprecate that part.

Also, I wanted to comment on how I see people using Jupytext:

  • They need to chose which text format they are going to use. I think the most common formats are py:percent (inter-operable with VS Code and PyCharm and many other editors), and md or md:myst, the MyST Markdown format (they can chose the paired text format per notebook, through the Jupytext Menu or command, or per folder using a jupytext.toml configuration file). Do you think that the pairing should appear in the hook, or remain as currently in either the notebook metadata, or in the jupytext.toml configuration file?
  • Also, there is the question of when the synchronization occurs. Jupytext provides a content manager for Jupyter Lab & notebook that will synchronize the paired files on disk when the user saves the notebook. But that does not cover the case when a) the paired file is edited on disk and not reopened + saved in Jupyter, or b) the ipynb file is edited in PyCharm or VS code, and these are two additional reasons why the pre-commit hook is going to help. Now, do you think that the --sync option of Jupytext, that determines which file should be updated based on timestamps, is the right approach? I am asking because in the content manager our approach is a bit different: as we always write the .ipynb file first, the direction of merge is always take input cells from the text file & outputs from ipynb, and we refuse to merge when the ipynb file is more recent than the text version. This check can save a user's notebook if they edited the .ipynb notebook with either Jupytext turned off, or if they edited the notebook in VS Code or PyCharm. Of course, if we trust timestamps we won't have this issue with jupytext --sync... so my question is merely... do you trust timestamps?

@JohnPaton
Copy link
Contributor Author

Hey, thanks for thinking along! I'll check out the existing tests if you agree this is the right approach.

do you think we should keep the --pre-commit argument and the custom pre-commit hooks examples? Maybe the work that you're doing will soon deprecate that part.

I think it would make sense to still have examples of how to use Jupytext as a pre-commit hook outside of the pre-commit framework, but since the framework is pretty widespread (and since the current naming will certainly be confusing) maybe it would be an option to have examples showing how it could work, but to drop the --pre-commit option. Then the examples could be like

git ls-files --modified **.ipynb | xargs jupytext --add-untracked

or something like that (I didn't test this, probably wrong but you get the idea).

Do you think that the pairing should appear in the hook, or remain as currently in either the notebook metadata, or in the jupytext.toml configuration file?

I think this can stay as it is in the PR, no? I'm not 100% sure what the default behaviour of Jupytext is, but I assume there's a prioritization for determining the output format like (in some order)

  • Check CLI args
  • Check Notebook Metadata
  • Check jupytext.toml
  • Else fail

That would give the most flexibility to either only use the hooks, have a global settings file in VCS, or manage it per file. Is that pretty much the current situation?

Now, do you think that the --sync option of Jupytext, that determines which file should be updated based on timestamps, is the right approach?

I'm not sure I 100% understand the use case for --sync tbh... is it there mainly to support the content manager? Thinking about it naively it seems to me like the .ipynb should basically always be the input in the case you are describing. Otherwise the output notebook will have outputs from the older .ipynb and non-matching inputs cells from the text file, right? Happy to be educated more about this.

For this use case, I mostly see the hook as ensuring that any changes to notebooks are reflected in the matching text files (e.g. for PR reviewability), but this is obviously tinted by the fact that that is my own use case.

@mwouts
Copy link
Owner

mwouts commented Jan 3, 2021

Hi @JohnPaton , thank you again for your useful comments!

I'll check out the existing tests if you agree this is the right approach.

Oh I just had a closer look at your PR, I like it very much. Don't bother with that, it should be easy for me to add coverage for the new --add-untracked option.

I think it would make sense to still have examples of how to use Jupytext as a pre-commit hook outside of the pre-commit framework, (...) but to drop the --pre-commit option. Then the examples could be like git ls-files --modified **.ipynb | xargs jupytext --add-untracked

I agree! And it's almost working, only here my shell does not seem to understand **.pynb - but maybe we can just call jupytext on every modified file and ignore those that are not notebooks.

(about output formats) I think this can stay as it is in the PR, no?

I think I am seeing a default ipynb -> py:percent in the .pre-commit-hooks.yaml, but actually there is no default in Jupytext, this setting must be given somewhere, either at the CLI, in the metadata, or in the jupytext.toml configuration file.

I'm not sure I 100% understand the use case for --sync tbh (...) Happy to be educated more about this.

Oh sure! And that is related to the previous question 😄 When the user get the text files they will be tempted to modify them, and I think it is better to report that modification onto the .ipynb file as well. Even if the outputs may become obsolete (that would also be the case anyway if they edited the notebook in Jupyter without re-executing, anyway). Don't you think so? Going further, I think that I'd recommend to run the jupytext hook on (modified) paired notebooks with any extension: (.py, .md, ... or .ipynb)

Also I've looked into previous issues in which people were looking for such a pre-commit hook, and I've found

  • jupytext changes notebooks but doesn't "save" them in a way git recognizes when working in markdown #338 - some commits with .md and .ipynb files that were out of sync, because the change was done on the .md file, and, in absence of the hook, the update of the .ipynb file occurs only when the user opens and then saves the notebook in Jupyter.
  • How to clean/tests notebooks? #432 - a discussion on a few tools we'd like to combine with the pre-commit hook: black, isort, flake8 for the Python part, and maybe pandoc --from ipynb --to ipynb --atx-headers to wrap long Markdown lines into shorter ones. BTW I think there's some interesting material there, I'd be interested into writing a blog post that would demo the pre-commit hook that you're contributing, plus these tools - that would make a great test for the hook, isn't it? I may try to write a sketch of the post next week, would you be interested to have a quick look / provide some feedback on that?

@JohnPaton
Copy link
Contributor Author

it's almost working, only here my shell does not seem to understand **.pynb

Turns out just *.ipynb is enough! So

git ls-files --modified *.ipynb | xargs jupytext --add-untracked

should work

I think I am seeing a default ipynb -> py:percent in the .pre-commit-hooks.yaml

That is correct, yeah. It felt a bit weird to have the hook not work by default. But actually if Jupytext doesn't define defaults, then the hook shouldn't either, so I will update my PR and the docs.

Don't you think so?

Ah yes, this does make sense, thanks! In light of this I think it makes even more sense to remove the defaults from the PR, and always have the user select them explicitly.

a few tools we'd like to combine with the pre-commit hook

Yeah, the flexibility to do this will be really important. I think the current --pipe functionality can handle this, right?

I'd be interested into writing a blog post [ ... ] would you be interested to have a quick look

Absolutely!

@JohnPaton
Copy link
Contributor Author

Btw, just noticed that this

but maybe we can just call jupytext on every modified file and ignore those that are not notebooks.

Is not the default behaviour - maybe we should either document (and implement) that this is the behaviour when --add-untracked is used, or else add another flag like --ignore-unmatched or something and then enable both options in the hook. What do you prefer?

@JohnPaton
Copy link
Contributor Author

JohnPaton commented Jan 4, 2021

I added a few tests for --add-untracked and for the pre-commit hook itself now btw. CodeCov is complaining but I can't figure out why

@Skylion007
Copy link
Contributor

Skylion007 commented Jan 6, 2021

Hiya, sorry for joining this discussion lately. @mwouts Not sure if you remember but there are some old issues on how to use jupytext with pre-commit that aren't covered by this hook structure. #580, which wasn't one of the two linked so I didn't get a notification of this thread.

Ah yes, this does make sense, thanks! In light of this I think it makes even more sense to remove the defaults from the PR, and always have the user select them explicitly.

^ THIS 100% --sync would be the only default I'd be comfortable with.

Why not combine the other hooks using additional dependency instead of requiring them explicitly? We should just force the user to specify it for any other pipe commands they need as it's rather easy to do so.

There should definitely be a sync hook as default imo and it should --set-format behavior as default for bidirectional syncing. Otherwise, some one is going to edit one file or the other in a PR an get their changes overwritten and be annoyed later.

I'm not sure I 100% understand the use case for --sync tbh... is it there mainly to support the content manager? Thinking about it naively it seems to me like the .ipynb should basically always be the input in the case you are describing. Otherwise the output notebook will have outputs from the older .ipynb and non-matching inputs cells from the text file, right? Happy to be educated more about this.

Use case for us is generating google colabs from python scripts and vice versa. It works really well and ensure that the two files are in sync and either the colab or the script can be edited and the changes will propogate.

Oh sure! And that is related to the previous question 😄 When the user get the text files they will be tempted to modify them, and I think it is better to report that modification onto the .ipynb file as well. Even if the outputs may become obsolete (that would also be the case anyway if they edited the notebook in Jupyter without re-executing, anyway). Don't you think so? Going further, I think that I'd recommend to run the jupytext hook on (modified) paired notebooks with any extension: (.py, .md, ... or .ipynb)

💯 this. Came back here today to write a PR to do that, and found this thread. 💯 Posted a very messy custom hook below that does it, but I would like to make the official one use the types key properly. We can specify it for all types [rst,python,notebook] etc... files will override types: if specified by the user.

How we currently use the pre-commit hook for Jupytext.
The end user should be able to override values of the hook like the args, what files to operate on etc...

hooks:
    -   id: jupytext-sync
        name: Sync scripts and notebooks
        files: '^examples/tutorials/(colabs|nb_python)/(.*\.py|.*\.ipynb)$'
        entry: jupytext 
        args: --update-metadata '{"jupytext":{"notebook_metadata_filter":"all", "cell_metadata_filter":"-all"}, "accelerator":"GPU"}' --set-formats 'nb_python//py:percent,colabs//ipynb' --pipe black --pipe 'isort - --treat-comment-as-code "# %%"' --pipe-fmt 'py:percent' --sync
        pass_filenames: true
        additional_dependencies:
            - 'jupytext==1.6.0'
            - black
            - isort
        always_run: false
        language: python
        require_serial: true

@Skylion007
Copy link
Contributor

See the discussion from #580 . Left some comments on the PR. Sorry it took so long to cycle back to this.

@Skylion007
Copy link
Contributor

@mwouts Happy to help with out the blogpost 👍

@Skylion007
Copy link
Contributor

Skylion007 commented Jan 6, 2021

Oh I just had a closer look at your PR, I like it very much. Don't bother with that, it should be easy for me to add coverage for the new --add-untracked option.

I am against the githook actually modifing the git index on its own. It really should just return a non-zero exit code if it edits / modifies a file outside the git index instead, it shouldn't add the file to gitindex by itself because often that could be indicative on error.

@JohnPaton
Copy link
Contributor Author

I am against the githook actually modifing the git index on its own

I essentially agree. The reason for it in this case is that pre-commit doesn't consider changes to files if they aren't in the index, so in the PR we only put them there if they are new, and then fail the hook to alert the user.

I guess an alternative would just be to fail if the outputs aren't tracked with some kind of warning telling the user what's up?

@Skylion007
Copy link
Contributor

Skylion007 commented Jan 6, 2021

I am against the githook actually modifing the git index on its own

I essentially agree. The reason for it in this case is that pre-commit doesn't consider changes to files if they aren't in the index, so in the PR we only put them there if they are new, and then fail the hook to alert the user.

I guess an alternative would just be to fail if the outputs aren't tracked with some kind of warning telling the user what's up?

Exactly! Sorry about blasting your github message stacks today. Been thinking about implementing this pre-commit hook for a while and finally was going to do it today 😆 I am against even adding new files to the index though because new notebook files, for instance, could have 100mbs of images and could pollute the Git repo inadvertently.

@Skylion007
Copy link
Contributor

@JohnPaton
Copy link
Contributor Author

Hey, sorry for radio silence - been a busy few days. I'll go through your comments now. But you've convinced me that just failing the hook with a clear message is a better approach ✅

@mwouts mwouts added this to the 1.9.2 milestone Jan 14, 2021
@mwouts
Copy link
Owner

mwouts commented Jan 21, 2021

What a great work done at #698! I'll close this issue now, do some more testing/doc updates at #722, and soon we'll be ready for a release...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants