Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Jupyter-cache integration #55

Merged
merged 40 commits into from
Mar 28, 2020

Conversation

AakashGfude
Copy link
Member

@AakashGfude AakashGfude commented Mar 9, 2020

Simple implementation of integration of jupyter-cache in myst-nb's parser.

Current implementation

  • gets a local db instance, else creates it (implementation in jupyter-cache)
  • stages the notebook
  • executes the notebook
  • merges the executed notebook with the original one and forwards it for parsing.

Apart from making the present implementation more robust, I feel a few functionalities need to be present in MyST-NB :-

  • Having the functionality to use the clear_cache function of jupyter_cache. It might be necessary if someone wants to re-execute the notebooks without actually changing the code of any of them
  • parallel execution of notebooks. Since, the parser function takes one notebook at a time, implementing parallel execution in jupyter-cache level won't be enough, and parallel execution cannot be implemented in parser I think, because it will create a new instance of multiprocessing library for each notebook.

My suggestion for both of the above functionalities is to implement them in a sphinx builder.

Please provide your suggestions @choldgraf @chrisjsewell @mmcky, I will meanwhile continue the implementation to handle as many cases as possible.

@AakashGfude AakashGfude changed the title FEAT: Jupyter cahe integration FEAT: Jupyter-cache integration Mar 9, 2020
@AakashGfude
Copy link
Member Author

Also, we should have a proper traceback for debugging execution errors, primarily a report file to store all the tracebacks.
At present, we have an error message indicating, the failed notebook.

image

@chrisjsewell
Copy link
Member

Also, we should have a proper traceback for debugging execution errors, primarily a report file to store all the tracebacks.
At present, we have an error message indicating, the failed notebook.

Note that all tracebacks (and eventually failed notebooks) are stored in the cache, see: https://github.com/ExecutableBookProject/jupyter-cache#staging-notebooks-for-execution

@AakashGfude
Copy link
Member Author

Also, we should have a proper traceback for debugging execution errors, primarily a report file to store all the tracebacks.
At present, we have an error message indicating, the failed notebook.

Note that all tracebacks (and eventually failed notebooks) are stored in the cache, see: https://github.com/ExecutableBookProject/jupyter-cache#staging-notebooks-for-execution

Great, thanks. So, I should recover the cache information using the pk of the notebook. I will test this out tomorrow morning.

@chrisjsewell
Copy link
Member

Great, thanks. So, I should recover the cache information using the pk of the notebook. I will test this out tomorrow morning.

Yes, note that it is stored on the staged record, not a cache record, because obviously since the notebook failed, it is not cached and there is no cache record created

@chrisjsewell
Copy link
Member

chrisjsewell commented Mar 9, 2020

parallel execution of notebooks. Since, the parser function takes one notebook at a time, implementing parallel execution in jupyter-cache level won't be enough, and parallel execution cannot be implemented in parser I think, because it will create a new instance of multiprocessing library for each notebook.

I mentioned this in an issue before (can't remember which one!), but I feel that this should be handled around step (3) of the build process below (by adding an event callback), whereby you stage all notebooks, then call execute. Then, in the parse step (7), you will just attempt to retrieve a cached notebook or, on KeyError, the trackback from the staged record.

1. event.config-inited(app,config)
2. event.builder-inited(app)
3. event.env-get-outdated(app, env, added, changed, removed)
4. event.env-before-read-docs(app, env, docnames)

for docname in docnames:
    5.  event.env-purge-doc(app, env, docname)
    if doc changed and not removed:
        6. source-read(app, docname, source)
        7. run parser: text -> docutils.document
        8. apply transforms (by priority): docutils.document -> docutils.document
        9. event.doctree-read(app, doctree)

10. event.env-updated(app, env)
11. event.env-check-consistency(app, env)

for docname in docnames:
    12. apply post-transforms (by priority): docutils.document -> docutils.document
    13. event.doctree-resolved(app, doctree, docname)

14. call builder

15. event.build-finished(app, exception)

@chrisjsewell
Copy link
Member

FYI, before this is merged, make sure to add some tests, using the notebooks in https://github.com/ExecutableBookProject/MyST-NB/tree/master/tests/notebooks (for an example of a full sphinx integration test, see https://github.com/ExecutableBookProject/MyST-Parser/blob/master/tests/test_sphinx/test_sphinx_builds.py)

@choldgraf
Copy link
Member

choldgraf commented Mar 9, 2020

Nice! Thanks for this nice start! a few quick thoughts from me:

  • I may make some pushes to this branch today to add some extra functionality
  • I'm a bit hesitant for MyST-NB to do much in the way of fancy execution. I think that as a first step, we should focus on "simple" execution (e.g. add two configuration variables: one for "execute" one for "cache". If the first is triggered, all ipynb files will be executed at build. If the second is triggered, all ipynb files will be executed and cached, and subsequent builds will use the cache. I think we should discuss whether this tool should be used for things like large-scale parallel execution, or whether that should be instead built into jupyter-cache and manually run from the CLI for larger jobs.
  • Agreed people need to be able to clear the cache...could we assume that users can just delete the .jupyter_cache folder in that case? Is there a way to do this outside of a CLI? (since myst-nb has no CLI...perhaps we can just recommend the jupyter-cache CLI once it solidifies)

Another thing to remember - users of Jupyter Book should need to know nothing about MyST-NB, we could also make some of these improvements (e.g. clearing cache and parallel execution) in the jupyter-book CLI. The MyST-NB repo might be used directly by software developers in their Sphinx builds, so we can make more advanced assumptions about their comfort with coding in general.

@choldgraf
Copy link
Member

I made a PR to your branch @AakashGfude - I had started some work on this earlier, and that PR is to help give an idea where I was going, and give you some code to work with if it's helpful 👍

@AakashGfude
Copy link
Member Author

AakashGfude commented Mar 10, 2020

  • I may make some pushes to this branch today to add some extra functionality
    Thank you so much. I will have a look at the PR.
  • Agreed people need to be able to clear the cache...could we assume that users can just delete the .jupyter_cache folder in that case? Is there a way to do this outside of a CLI? (since myst-nb has no CLI...perhaps we can just recommend the jupyter-cache CLI once it solidifies)

I assumed the user will have no knowledge about the internals of caching and DB being used. And imagined that the command to clear cache would be done in the background, when something like make clean or make clean-cache is run And, also since .jupyter_cache folder is not visible in the file explorer. But I guess, most if not all the users might be using a text editor anyway so visibility might not be an issue. I just wonder how intuitive it will be to delete the folder.

Another thing to remember - users of Jupyter Book should need to know nothing about MyST-NB, we could also make some of these improvements (e.g. clearing cache and parallel execution) in the jupyter-book CLI. The MyST-NB repo might be used directly by software developers in their Sphinx builds, so we can make more advanced assumptions about their comfort with coding in general.

This is a nice idea. We will need to decide which layers should have what functionalities. I am not entirely clear about the functionalities jupyter-book is supposed to have. missed the discussions. will get a clearer idea.

  • I'm a bit hesitant for MyST-NB to do much in the way of fancy execution. I think that as a first step, we should focus on "simple" execution (e.g. add two configuration variables: one for "execute" one for "cache". If the first is triggered, all ipynb files will be executed at build. If the second is triggered, all ipynb files will be executed and cached, and subsequent builds will use the cache. I think we should discuss whether this tool should be used for things like large-scale parallel execution, or whether that should be instead built into jupyter-cache and manually run from the CLI for larger jobs.

Like you said above, users of Jupyter Book need not know about MyST-NB and internals. So we should avoid situations in which they might have to use the internal complicated tools?

@AakashGfude
Copy link
Member Author

I mentioned this in an issue before (can't remember which one!), but I feel that this should be handled around step (3) of the build process below (by adding an event callback), whereby you stage all notebooks, then call execute. Then, in the parse step (7), you will just attempt to retrieve a cached notebook or, on KeyError, the trackback from the staged record.

I had also imagined in the same lines. @choldgraf what are your thoughts on this?

@choldgraf
Copy link
Member

yeah that sounds reasonable to me - it'll be hard to tell without a prototype to play around with though, I'm happy to try something out when it's ready

@AakashGfude
Copy link
Member Author

AakashGfude commented Mar 12, 2020

@choldgraf @chrisjsewell have reorganized the code in the last few commits as a test, according to the discussion with @chrisjsewell above regarding parallel execution, where the notebooks are executed and cached during the env-get-outdated event stage. And in the parser, the cached outputs and original notebooks are merged.

Feel free to scrap this idea, or embrace it.

@AakashGfude
Copy link
Member Author

Also @choldgraf , I got this attribute error for genutils after merging with master. Did the merge tamper anything?

image

PS:- I will remove this comment later to prevent cluttering.

@chrisjsewell
Copy link
Member

Sorry @AakashGfude with 688e421 you'll need to refactor again 😬

myst_nb/cache.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@choldgraf
Copy link
Member

Looking better! The cache worked for me this time 🎉

A few quick thoughts:

  • We probably don't need a message for both each execution and completion, maybe just one to avoid cluttering the messages
  • When Jupyter Cache isn't set I'm getting an error: WARNING: The config value jupyter_cache' has type str', defaults to bool'.`
  • Maybe we can collapse the number of configuration keys down by one, by making jupyter_execute_notebooks allow two values: "force" or "auto"

What other kind of functionality should I try to do to break it? :-)

@AakashGfude
Copy link
Member Author

AakashGfude commented Mar 25, 2020

  • We probably don't need a message for both each execution and completion, maybe just one to avoid cluttering the messages

You are right, Ideally, we should have something like:-
image

the screenshot is from jupinx . I thought of working on that after this PR is merged, as it might require discussions. What do you think?

  • Maybe we can collapse the number of configuration keys down by one, by making jupyter_execute_notebooks allow two values: "force" or "auto"

I agree there are too many variables. this one sounds good. will do a commit for it

What other kind of functionality should I try to do to break it? :-)
One I think is to check if the reports are getting generated. By having one notebook failing execution with an error.

  • When Jupyter Cache isn't set I'm getting an error: WARNING: The config value jupyter_cache' has type str', defaults to bool'.`

can you send me the traceback for this? Not able to get it in my system

@choldgraf
Copy link
Member

Re: the last point, it wasn't an error, just a warning, so there was no trace back. Sphinx will throw a warning if the value of a config key is of a different type from the default value

@AakashGfude
Copy link
Member Author

AakashGfude commented Mar 25, 2020

Re: the last point, it wasn't an error, just a warning, so there was no trace back. Sphinx will throw a warning if the value of a config key is of a different type from the default value

This is weird, I am not able to reproduce and not able to see in the code where the error is. Maybe will need your help in that, can you grep that variable and see if its set to a string somewhere in your system?

Also, about this point

Maybe we can collapse the number of configuration keys down by one, by making > jupyter_execute_notebooks allow two values: "force" or "auto"

It won't handle the case where users don't want to execute at all. We can have three values for this variable. The third one being "off" ? I am not good in naming things

@choldgraf
Copy link
Member

Yeah - "off" seems fine if we don't want the sphinx warning about types...otherwise None or False would work...

@AakashGfude
Copy link
Member Author

@choldgraf Have done the necessary changes, and jupyter_execute_notebooks can now have "force", "auto", "off" values. I have also updated the docs.

@choldgraf
Copy link
Member

Agree re: the jupinx loading bars, I think that style would be best!

A few more thoughts:

  • When I have jupyter_execute_notebooks set to "force", but jupyter_cache to True, then it seems to keep using the cache. This is a bit confusing as I assume it would force re-execution each time.
  • What about we collapse the config a bit further. jupyter_execute_notebooks take 4 possible options: auto (default), force (always execute), cache (equiv to jupyter_cache = True), and off. Then, jupyter_cache is always used just if a person wants to point to a different cache location.
  • Execution exclusion doesn't seem to work for me, I put execution_excludepatterns = ['*glue*'], and docs/use/glue.ipynb was still executed. Does it not support glob-like patterns? (if not, can we support that?)
  • I noticed that the "Executing notebook" log message doesn't show up if we don't use a jupyter cache - is there a reason not to tell people it's executing if we aren't cacheing? I think we could say "executing" if jupyter_cache = False, and "executing and cacehing" if jupyter_cache = True.

@AakashGfude
Copy link
Member Author

  • What about we collapse the config a bit further. jupyter_execute_notebooks take 4 possible options: auto (default), force (always execute), cache (equiv to jupyter_cache = True), and off. Then, jupyter_cache is always used just if a person wants to point to a different cache location.

So, in this case, auto will use the cache as well right. What will be the difference then with cache? And the person will use the jupyter_cache variable only when he wants to point to a different location?

  • Execution exclusion doesn't seem to work for me, I put execution_excludepatterns = ['*glue*'], and docs/use/glue.ipynb was still executed. Does it not support glob-like patterns? (if not, can we support that?)

Yes, at present it was just doing a substring check of execution_excludepatterns with the path of files. I will have a look at how to support glob-like patterns.

  • I noticed that the "Executing notebook" log message doesn't show up if we don't use a jupyter cache - is there a reason not to tell people it's executing if we aren't cacheing? I think we could say "executing" if jupyter_cache = False, and "executing and cacehing" if jupyter_cache = True.

I saw some logs about execution from nbclient as well. But, I just realized that it shows the log only for markdown files. And it executes markdown files as well. I will supress that by configuring it. and give logs as mentioned

@AakashGfude
Copy link
Member Author

@choldgraf have updated the functionalities, feel free to test it and let me know your thoughts

@choldgraf
Copy link
Member

I just gave it another shot, and it worked really nicely 👍 - I tried all the different config options, and it worked really well, great job! I pushed a small extra bit to the documentation and also cleared out the outputs in our ipynb files in the docs so that we can actually put this to the test on a daily bases :-)

One thing that I foresee potentially confusing people in the future: we should provide some kinds of guidance for how to "clear the cache". E.g., do they delete _build? What if they only want to delete the pages to trigger a re-build, but don't want to remove the cache artifacts? That kinda thing...I don't know that it's super actionable right now though, because I'm not sure what the right path forward is, maybe something to keep an eye on as people use it.

I think that this one is nearly ready to go - we should treat this functionality as "late-alpha, early beta" for a while, so there's time for users to give the API a shot and provide feedback, squash bugs, etc. But I think it's ready to go from a user functionality and docs standpoint :shipit:

@chrisjsewell what do you think?

tests/test_execute.py Outdated Show resolved Hide resolved
@chrisjsewell
Copy link
Member

Looks good to me 👍

@AakashGfude
Copy link
Member Author

AakashGfude commented Mar 27, 2020

One thing that I foresee potentially confusing people in the future: we should provide some kinds of guidance for how to "clear the cache". E.g., do they delete _build? What if they only want to delete the pages to trigger a re-build, but don't want to remove the cache artifacts? That kinda thing...I don't know that it's super actionable right now though, because I'm not sure what the right path forward is, maybe something to keep an eye on as people use it.

What I think we can have is have targeted clean methods like make clean to clear everything. make clean-html to clear html outputs while preserving the cache. make clear-cache to clear .jupyter_cache only(We will have to take custom jupyter_cache path provided into account).

deleting the pages from the cache to trigger a re-build without removing the artifacts is something we need to think about. And also where the functionality should be implemented like for example the above make commands can be a part of the cli Makefile.

@choldgraf
Copy link
Member

fantastic! I'll plan on giving this a merge when I'm back at my computer (prob tomorrow morning)

@chrisjsewell
Copy link
Member

You could then bump the version to 0.5.0a1 and create a release: https://github.com/ExecutableBookProject/MyST-NB/releases, and it will automatically be published to PyPi as a pre-release during the triggered travis-CI

@choldgraf
Copy link
Member

alright let's 🚢 it!

@choldgraf choldgraf merged commit f186cc5 into executablebooks:master Mar 28, 2020
@choldgraf
Copy link
Member

choldgraf commented Mar 28, 2020

OK I merged, bumped the version on master (#103) , and also added a release: https://github.com/ExecutableBookProject/MyST-NB/releases/tag/0.5.0a1. I bumped master before making the official release on github, so I triggered a travis re-build because I'm assuming that needs to happen once a release already exists

@chrisjsewell
Copy link
Member

I bumped master before making the official release on github, so I triggered a travis re-build because I'm assuming that needs to happen once a release already exists

Not sure if I understand this, but when you make a release that triggers Travis (or Gihub actions as we may move to soon), so you don't need to manually trigger. But anyway its all sorted 👍

FYI, if ppl didn't know, you now have the pre-release on https://pypi.org/project/myst-nb/#history, whereby this will not be installed doing pip install myst-nb, it will only be installed if you specifically specify pip install myst-nb==0.5.0a1

phaustin pushed a commit to phaustin/MyST-NB that referenced this pull request Apr 1, 2020
* added functionality of jupyter-cache
* writing tests to cover the all/most cases
* adding execution and cacheing documentation
* moved fixtures to conftest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants