Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First-pass thoughts on the CLI etc #16

Closed
choldgraf opened this issue Feb 28, 2020 · 26 comments
Closed

First-pass thoughts on the CLI etc #16

choldgraf opened this issue Feb 28, 2020 · 26 comments

Comments

@choldgraf
Copy link
Member

choldgraf commented Feb 28, 2020

I'm going to have a go at trying out the CLI and the docs, and will report back thoughts or confusions as they come up. I am going to be sort-of intentionally dense here, because that'll be the perspective of users when they first start using the tool (apologies to new users if they come across this statement lol)

In general, I think that it's moving in the right direction. I'm impressed at the machinery under the hood and excited to get the API sorted a bit. Most of my comments are around clarity and UX for the CLI (I think we should do something similar for the Python API as well).

here are my very intentionally verbose comments :-)

  • What's the difference between stage-nb and stage-nbs? It seems like these should be doing the same thing, but one of them allows asset paths and the other doesn't? Is there a way that we could allow for a single stage command that could handle both a single- and multi-notebook case? **edit: it looks like stage-nb is in fact a significantly different use-case? In that case we should call the verb something different. I was confused that stage-nb takes as a first argument a list of paths to assets, while stage-nbs takes paths to notebooks.

  • To that extent, I feel like it would be less cognitively-burdensome to have three verbs at the user's disposal:

    • stage
    • execute
    • commit
    • As opposed to having "-nbs" in there (and where we use one verb for both single-and multi-notebook cases). Unless we expect that this will ever be used to "stage" a thing that is not also a notebook?
  • It feels awkward to have commit-limit as one of the top-level user-facing commands. Do we expect that users will often use this verb? If not, or if we imagine similar functionality where we want users to control the global behavior using the CLI, perhaps another verb could be jcache config <verbs-to-control-config>?

  • A convenience verb like jcache status would be helpful. I found myself naturally wanting to type it since we are using so much git terminology.

  • The execute docstring should be "execute staged and outdated notebooks", right?

  • I'm also feeling that jcache is a lot of letters to type if I am doing a lot of typing etc...I wonder if there is a more short-hand entrypoint to the CLI we can think of.

  • If I execute a notebook with an empty cell, then I get the following error because the cell wasn't executed:

    jupyter_cache.base.NbValidityError: Expected cell 2 to have execution_count 2 not None
    
  • After then deleting those empty cells and re-running jcache execute, I am now getting this much more confusing error:

  Executing: /c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/docs/viz.ipynb
Failed Commit: /c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/docs/viz.ipynb
Traceback (most recent call last):
  File "/c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/jupyter_cache/executors/basic.py", line 65, in run
    self.cache.commit_notebook_bundle(final_bundle, overwrite=True)
  File "/c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/jupyter_cache/cache/main.py", line 202, in commit_notebook_bundle
    self._validate_nb_bundle(bundle)
  File "/c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/jupyter_cache/cache/main.py", line 177, in _validate_nb_bundle
    nb_bundle,
jupyter_cache.base.NbValidityError: Expected cell 3 to have execution_count 3 not 10
Finished!
(dev) ✔ /c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache [master|✚ 1…1] 
12:00 $ jcache execute
Executing: /c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/docs/viz.ipynb
Failed Commit: /c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/docs/viz.ipynb
Traceback (most recent call last):
  File "/c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/jupyter_cache/executors/basic.py", line 65, in run
    self.cache.commit_notebook_bundle(final_bundle, overwrite=True)
  File "/c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/jupyter_cache/cache/main.py", line 202, in commit_notebook_bundle
    self._validate_nb_bundle(bundle)
  File "/c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/jupyter_cache/cache/main.py", line 177, in _validate_nb_bundle
    nb_bundle,
jupyter_cache.base.NbValidityError: Expected cell 2 to have execution_count 2 not None
Finished!
(dev) ✔ /c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache [master|✚ 1…1] 
12:01 $ jcache execute
Executing: /c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/docs/viz.ipynb
Failed Commit: /c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/docs/viz.ipynb
Traceback (most recent call last):
  File "/c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/jupyter_cache/executors/basic.py", line 65, in run
    self.cache.commit_notebook_bundle(final_bundle, overwrite=True)
  File "/c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/jupyter_cache/cache/main.py", line 235, in commit_notebook_bundle
    self.truncate_commits()
  File "/c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/jupyter_cache/cache/main.py", line 101, in truncate_commits
    COMMIT_LIMIT_KEY, self.db, DEFAULT_COMMIT_LIMIT
  File "/c/Users/chold/Dropbox/github/forks/python/ebp/jupyter-cache/jupyter_cache/cache/db.py", line 63, in get_value
    result = session.query(Setting.value).filter_by(key=key).one_or_none()
  File "/home/choldgraf/anaconda/envs/dev/lib/python3.7/site-packages/sqlalchemy/orm/query.py", line 3257, in one_or_none
    ret = list(self)
  File "/home/choldgraf/anaconda/envs/dev/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 105, in instances
    util.raise_from_cause(err)
  File "/home/choldgraf/anaconda/envs/dev/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 398, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/home/choldgraf/anaconda/envs/dev/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 153, in reraise
    raise value
  File "/home/choldgraf/anaconda/envs/dev/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 89, in instances
    for row in fetch
  File "/home/choldgraf/anaconda/envs/dev/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 89, in <listcomp>
    for row in fetch
  File "/home/choldgraf/anaconda/envs/dev/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 88, in <listcomp>
    keyed_tuple([proc(row) for proc in process])
  File "/home/choldgraf/anaconda/envs/dev/lib/python3.7/site-packages/sqlalchemy/engine/result.py", line 107, in __getitem__
    return processor(self._row[index])
  File "/home/choldgraf/anaconda/envs/dev/lib/python3.7/site-packages/sqlalchemy/sql/sqltypes.py", line 2255, in process
    return json_deserializer(value)
  File "/home/choldgraf/anaconda/envs/dev/lib/python3.7/json/__init__.py", line 341, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not int
  • but, running jcache execute again seems to now not have any errors. I think this is because the second time I ran it, it didn't try to execute the notebook. The error happens again if I change the notebook code cell.
  • Does commit-nb strictly require that a notebook be executed? Or can you do a straight jcache stage-nb myntbk.ipynb -> jcache commit-nb?
  • jcache-execute should say how many notebooks it skipped and executed in the "finished" statement. Something like Finished executing 3 notebooks (5 non-stale notebooks skipped)
  • It's unclear to me what cat-artifact does. The docstring doesn't disambiguate what PK or ARTIFACT_RPATH is. I'm assuming PK is "Primary Key", but that is database speak not human speak :-)
  • I'm also not sure what an "artifact" is...is it something generated from running code?
  • diff-nb - I'm not sure when a notebook is officially "cached" or not. Is it when I run jcache execute? Or do I have to explicitly do the 'commit'?
  • commit-nb - based on the logic of previous steps, I was assuming that this would just "commit everything that is staged right now". I think this is because we are overloading git language. Another suggestion from me that we might this word to be cache not commit to make it explicit when cacheing is happening, and to avoid confusion with git.
  • Ah I just got burned by the API difference between commit-nb and commit-nbs again...that is a confusing one...
  • I can't go any further because any time I try executing the notebook, I'm getting TypeError: the JSON object must be str, bytes or bytearray, not int and any time I then try to commit the notebook, it warns me that the notebook wasn't executed.

OK I hope that this is helpful, I'm happy to step through subsequent iterations!

@choldgraf
Copy link
Member Author

@mmcky it would be very helpful if you or others at the QE team could take similar pass-throughs on the jupyter-cache CLI. I think that this functionality is most-relevant to the QuantEcon usecase so it'll be important that you all are comfortable with the API/CLI/etc.

@chrisjsewell
Copy link
Member

chrisjsewell commented Feb 28, 2020

In general, I think that it's moving in the right direction. I'm impressed at the machinery under the hood and excited to get the API sorted a bit. Most of my comments are around clarity and UX for the CLI (I think we should do something similar for the Python API as well).

Cheers 😄 Yeh obviously it's still at a relatively early stage, with the CLI (currently) aimed at showcasing the full functionality, rather than necessarily being 'user-friendly'. These comments will start helping to achieve that.

What's the difference between stage-nb and stage-nbs? It seems like these should be doing the same thing, but one of them allows asset paths and the other doesn't? Is there a way that we could allow for a single stage command that could handle both a single- and multi-notebook case?

As you say, stage-nbs is the basic use case, whereby your notebooks don't rely on any assets and so you can just bulk list a number of them. Note this command does allow you to specify a single notebook, so would be the general 'goto' for both.

In that case we should call the verb something different. I was confused that stage-nb takes as a first argument a list of paths to assets, while stage-nbs takes paths to notebooks.

Then stage-nb came later; thinking about how to also include specific assets for each notebook.
It should probably be named differently, like stage-with-assets but less verbose.

I also agree there is currently a cognitive dissonance on their CLI use. This is because it'll taking some 'CLI wizardry` to get it to work the same, which I haven't got round to coding yet; i.e. it would be better to have:

$ jcache stage-nb --assets asset1 asset2 PATH/TO/NB.ipynb

But you need to get --assets to not be 'greedy' and consume the last argument as well. This is a semi-standard pattern for that (with --- as a delimiter).

$ jcache stage-nb --assets asset1 asset2 --- PATH/TO/NB.ipynb

For the API, I've made specifying assets very general, so you specify each one individually. But for practical use cases, and for a more automated approach, you probably want a simpler heuristic, e.g. that all assets (if any) are in a folder, with the same name as the notebook, and are read automatically. Do you think this makes sense?

To that extent, I feel like it would be less cognitively-burdensome to have three verbs at the user's disposal:

  * `stage`
  * `execute`
  * `commit`

It feels awkward to have commit-limit as one of the top-level user-facing commands. Do we expect that users will often use this verb? If not, or if we imagine similar functionality where we want users to control the global behavior using the CLI, perhaps another verb could be jcache config <verbs-to-control-config>?

On both these point, I would envisage the 'final' CLI to have a nested structure, i.e. there would be top-level commands/groups (stage, commit, execute, config), then potentially sub-commands, like:

$ jcache config commit-limit
$ jcache stage list
$ jcache stage nbs

This keeps the CLI 'tidyier', with the only con being that you have to go through this extra layer to get to certain commands. Does that sound reasonable? Note with click-completions (see below) its quite easy to quickly access these commands via tab-completion: j<tab> s<tab> n<tab>

A convenience verb like jcache status would be helpful. I found myself naturally wanting to type it since we are using so much git terminology.

Sounds reasonable, what information would you expect to receive from this command? Also, given the tab completion I mention above, it would be better to call this something else that doesn't 'clash' with stage.

The execute docstring should be "execute staged and outdated notebooks", right?

👍

I'm also feeling that jcache is a lot of letters to type if I am doing a lot of typing etc...I wonder if there is a more short-hand entrypoint to the CLI we can think of.

Yeh I couldn't think of a good 4-letter word. But at the same time, you can just tab complete anyway. I'll probably add click-completions, which also makes tab completion available for all sub-commands and arguments.

...

@chrisjsewell
Copy link
Member

If I execute a notebook with an empty cell, then I get the following error because the cell wasn't executed:

jupyter_cache.base.NbValidityError: Expected cell 2 to have execution_count 2 not None

After then deleting those empty cells and re-running jcache execute, I am now getting this much more confusing error...

Ah I didn't realise empty cells didn't get an execution count; that can be fixed. The idea of the NbValidityError is that it is a simple sanity check; that you are committing notebooks that have actually been executed properly. It probably needs some extra work for edge cases like this. Note this can be by-passed by:

$ jcache commit-nb --no-validate

but, running jcache execute again seems to now not have any errors. I think this is because the second time I ran it, it didn't try to execute the notebook. The error happens again if I change the notebook code cell.

We'll have to encapsulate this issue in a unit test, then I can fix it 😄

Does commit-nb strictly require that a notebook be executed? Or can you do a straight jcache stage-nb myntbk.ipynb -> jcache commit-nb?

All notebooks should only be committed if they are in a successfully executed state. It doesn't necessarily have to have been executed via jcache execute, if you have a notebook that you've executed by hand. Again with cache commit-nb --no-validate you can turn off checks.

jcache-execute should say how many notebooks it skipped and executed in the "finished" statement. Something like Finished executing 3 notebooks (5 non-stale notebooks skipped)

Yep 👍 Also with #14, I envisage additional reporting being stored (including failed notebooks). So, after execution, all the notebooks will still be staged, but doing jcache stage list or jcache stage show would give you extra information about its last execution (if available).

It's unclear to me what cat-artifact does. The docstring doesn't disambiguate what PK or ARTIFACT_RPATH is. I'm assuming PK is "Primary Key", but that is database speak not human speak :-)

This probably wouldn't be a generally used feature. It is just to print out the content of an artefact (see explanation below), with the PK being the pointer to the committed notebook and the ARTIFACT_PATH being the path it was 'found' relative to the notebook.

I'm also not sure what an "artifact" is...is it something generated from running code?

Yes. I do explain this a little in the README, but it is essentially any file present in the notebook's folder after execution (that wasn't there before), e.g. if you've used open("path", "w") in your notebook, or this could even be the final kernel state of the notebook (pickled by dill).

diff-nb - I'm not sure when a notebook is officially "cached" or not. Is it when I run jcache execute? Or do I have to explicitly do the 'commit'?

Yes jcache execute should/does 'auto-commit' all successfully executed notebook. You can also commit notebooks manually, outside of jupyter execute (see above).

commit-nb - based on the logic of previous steps, I was assuming that this would just "commit everything that is staged right now". I think this is because we are overloading git language. Another suggestion from me that we might this word to be cache not commit to make it explicit when cacheing is happening, and to avoid confusion with git.

Yep good point re: git confusion. At present 'staged' notebooks are just a pointer to the path of the actual notebook outside of the cache. So after jcache execute these notebooks aren't actually modified; the executor just commits the notebook(s) it has created, then the next time you check jupyter staged-list it will now be able to match (via hash) to a committed notebook. So I guess now there is a difference between what git/jcache means by staged:

  • git staged: this file is staged for a commit, and will be commited when git commit is run
  • jcache staged: this notebook is staged for execution, and will be executed when jcache execute is run (if it doesn't already exist in the cache).

So here staged actually has nothing to do with commit. Perhaps if commit was renamed to cache (as you've suggested), then people would not 'think' of git when the term stage is used. Outside of stage, I can't really think of a better word to describe it.

I can't go any further because any time I try executing the notebook, I'm getting TypeError: the JSON object must be str, bytes or bytearray, not int and any time I then try to commit the notebook, it warns me that the notebook wasn't executed.

This may be to do with the above points. Will adress this after they have been cleared up.

@chrisjsewell
Copy link
Member

chrisjsewell commented Feb 29, 2020

Here's a visual representation on the execution flow that (maybe!) will help:

image

Here the 'staged' notebooks are not modified by execution, and so should not be committed.

@chrisjsewell
Copy link
Member

chrisjsewell commented Feb 29, 2020

And here's the extension, for saving failed notebooks, mentioned in #14

image

Failed notebooks will be removed when the staged DB record is removed or re-executed

@AakashGfude
Copy link
Member

Was trying the CLI tool and needed a few clarifications as a dense user :) :-

It seems like there's two ways of saving an ipynb to the cache, one going through the staging environment, and the other to directly add it to the cache using jcache cache add-one. Since the execute command works on the staged files, the one added directly cannot be executed. Is this direct addition to cache functionality meant for cases like adding notebooks executed elsewhere?

I feel that if we are following git philosophy then it's better to have one flow, where it goes through the staging phase?

While saving the same notebook through these two different flows, I found that the DB can have duplicate entries, i.e, entries with the same URI, which creates confusion.

Wonderful tool bdw @chrisjsewell

@chrisjsewell
Copy link
Member

chrisjsewell commented Mar 3, 2020

These are the intended use cases yes, generally you shouldn’t need to add directly to the cache.

To clarify, the DB will have duplicate URIs. Every time you change a notebook and re-run the old run will still be there until it is cleaned. The uri does not denote unique notebooks it is just there as a record of where the notebook came from. After entry notebook are matched by hash and don’t care about the uri

(FYI, I’m literally about to take off, so won’t be replying for awhile!)

@choldgraf
Copy link
Member Author

Taking another pass now, will give my thoughts in the same long-form way as before!

  • I really like the new CLI layout with groups
  • I still feel a clash about add-many vs. add-one. What if add-many became just add, and add-one became add-with-artefacts?
  • So is the "staging" step purely for executing notebooks, and independent of caching? If so, I wonder if we should just use staging for the Python API, and keep the CLI to only using jcache execute. Are there any cases where one would want to use the CLI to stage a notebook without executing? I think there are use-cases where people would want to do this, but that it'd be fine for people to do it from the Python API. WDYT?
    • Alternatively, we could put execute as a sub-command for stage. Then it'd be jcache stage execute.
  • To that point, my intuition was that running jcache execute path/to/notebook.ipynb would work. But it looks like it expects things to be stored in staging first?
  • Could the jcache stage w/ arguments command be a short-hand for add-many? I think given jcache execute is a verb, jcache stage feels like it should also be a verb. (probably same w/ jcache cache)
  • If I execute a notebook and there were no errors or exceptions, then it prints excepted: []. This could be a more user-friendly None or something like this.
  • After running jcache stage add-many then jcache execute, I assumed I needed to run jcache cache myntbk.ipynb, but then it gave an error that the cells weren't run...so now I'm confused. I assumed that jcache cache would figure out that I had run the notebook.
  • OK now looking more closely at the docs, it seems like jcache execute will also cache the run notebooks. Is that right? In that case, why would a user want to run jcache cache? I think that cache is the primary verb that a user will think to run, so I am worried about having it as a top-level command if users won't run it often directly (but instead via the staging/execution step)
  • Once stuff is cached...what do users do next? They have this cache w/ notebooks, maybe artefacts, but what do we expect them to do with it? Or is this something that we'd expect to programmatically access with tools like our CLI?

So again trying to be naive about the workflow, this is what I'm intuiting from the CLI:

  1. A person creates a notebook
  2. They run jcache stage add-many myntbk.ipynb. It is now staged and ready for execution.
  3. They run jcache execute. It will execute the notebook and store the outputs in the stage cache.
  4. They run jcache cache add-many myntbk.ipynb, it will add the notebook outputs to the cache after a check.
  5. They edit the code cells of the notebook.
  6. They re-run jcache cache add-many myntbk.ipynb it...what? errors because the code cells have changed?

OK now after having looked at the docs a bit more, I now think this is the expected use-case:

  1. A person creates a notebook
  2. They run jcache stage add-many myntbk.ipynb. It is now staged and ready for execution.
  3. They run jcache execute. It will execute the notebook and store the outputs in the actual cache.
  4. Now we have two sets of notebooks - the ones in the cache, and the ones in the stage that we just executed.
  5. They edit the code cells of the notebook.
  6. They re-run jcache execute, it checks whether the notebooks have changed and decides whether to re-execute and cache.

Does this sound more correct? I'm leaving the first set of (incorrect) steps here so you can see where I was getting tripped up. And I know you were explaining a bit of this above, but again I'm being intentionally dense to show where stuff might not be intuitive.

But in general, this is definitely an improvement, I was able to get this working and figure out what's happening with the CLI much faster 👍

@chrisjsewell
Copy link
Member

chrisjsewell commented Mar 5, 2020

I really like the new CLI layout with groups

😀

I still feel a clash about add-many vs. add-one. What if add-many became just add, and add-one became add-with-artefacts?

Yep that can be done. Simlilarly this could be done for staging add-one -> add-with-assets

So is the "staging" step purely for executing notebooks

Yes, the staging solve two 'problems'

  1. To control what gets executed: only staged notebooks are 'available' for execution
  2. As a way to record failed execution data: Obviously the core cache only saves successfully executed notebooks and associated data. But if the execution fails, it is helpful to have a record of this. If you look at the README, you'll see that execution exceptions tracebacks are now back-populated to the staged records, and eventually the failed notebooks will also be (transiently) stored and linked to these records.

If so, I wonder if we should just use staging for the Python API, and keep the CLI to only using jcache execute

No, I think this is too restrictive and would make the execute command too complex, since it would have to then also implement the stage logic. You also lose the ability to (a) iteratively build up the execution list (e.g. using add-with-assets) and (b) later inspect the execution result, as in (2) above.

Alternatively, we could put execute as a sub-command for stage

If anything, this would be more sensible IMO

To that point, my intuition was that running jcache execute path/to/notebook.ipynb would work. But it looks like it expects things to be stored in staging first?

Yes.

I could add a shortcut like this; whereby there will be a prompt to confirm that this will wipe the currently staged notebooks, then add this one, before executing. But obviously this will increase the complexity of the jcache execute function (the goal is that the CLI functions should only be thin wrappers round the backend API)

Could the jcache stage w/ arguments command be a short-hand for add-many?

No you can't mix CLI groups with CLI commands. Think of groups (like stage) as classes and commands (like add-many) as their methods.

If I execute a notebook and there were no errors or exceptions, then it prints excepted: []. This could be a more user-friendly None or something like this.

Yeh that's possible

After running jcache stage add-many then jcache execute, I assumed I needed to run jcache cache myntbk.ipynb, but then it gave an error that the cells weren't run...so now I'm confused. I assumed that jcache cache would figure out that I had run the notebook.

No. As I mentioned to @AakashGfude above. Successfully executed notebooks are automatically cached. I will add a message(s) in jcache execute to make this clear.

The staged notebook records are just references to the source files: jcache execute does not modify them! Therefore, the content of, myntbk.ipynb will be no different before/after jcache cache (i.e. there will still be no outputs in it).

why would a user want to run jcache cache

(a) To inspect what is in the cache, and related execution statistics
(b) To directly add notebooks to the cache, that have been executed outside jcache execute

Once stuff is cached...what do users do next?

As mentioned above, they can inspect the cache for execution statistics etc.
But I guess the main use is to create 'merged' notebooks; containing 'up-to-date' text cells from the 'source' file (text or notebook based) and up 'up-to-date' code cells (+ execution specific metadata) from the cached notebook. Currently this is only available in the API (see merge_match_into_file), but could also me added to the CLI.

OK now after having looked at the docs a bit more

I can't be accountable for people not reading the docs properly 😜

Now we have two sets of notebooks - the ones in the cache, and the ones in the stage that we just executed.

Again, It's important to note that there is no physical store of 'staged' notebooks, they are just a reference to the source file URIs.

Does this sound more correct?

Generally yes though 👍

@mmcky
Copy link
Member

mmcky commented Mar 5, 2020

my 2c. I think the confusion with stage as a name is that to me it indicates it is staged for organising commits to the cache but rather it effectively is a stage manager to collect items for the executor (with automatic entry in the cache). Is that right?

if we adopted a git style design -- we could stage notebooks. Then the stage could determine if they need executing based on checking notebooks for output or some configuration settings. If a notebook executed correctly it would be added to the cache and notebooks that returned issues would remain in the stage phase?

I think this is what @AakashGfude had in mind with removing direct entry into the cache and requiring everything to go through the stage pipeline.

@chrisjsewell
Copy link
Member

chrisjsewell commented Mar 5, 2020

I think the confusion with stage as a name is that to me it indicates it is staged for organising commits to the cache but rather it effectively is a stage manager to collect items for the executor (with automatic entry in the cache). Is that right?

Yep, it does now specifically state this in the CLI:

$ jcache -h
...
    stage    Commands for staging notebooks to be executed.

if we adopted a git style design -- we could stage notebooks. Then the stage could determine if they need executing based on checking notebooks for output or some configuration settings. If a notebook executed correctly it would be added to the cache and notebooks that returned issues would remain in the stage phase?

Well yes, that's basically the current design? A subtle difference I note though is that you are implying that notebooks that executed correctly should be removed from the stage phase.
I don't agree with this, because it means that you have to re-add everything to the staging phase every time you want to re-execute.

I think this is what @AakashGfude had in mind with removing direct entry into the cache and requiring everything to go through the stage pipeline.

Yeh possibly, although equally you shouldn't be forced to go via jcache execute to add a notebook to the cache.

@choldgraf
Copy link
Member Author

Could that confusion be improved by my suggestion to put the execute command inside the stage group? I think that'd make it clearer that execute only makes sense in the context of stage and not cache

@choldgraf
Copy link
Member Author

choldgraf commented Mar 8, 2020

@chrisjsewell giving another shot at figuring out the merging etc behavior. Thanks again for putting this package together, I think it's pretty handy! I think I have enough context to play around w/ the jupyter-book implementation.

Here are a few questions/feedback/etc from me:

I've managed to get multiple copies of a file in my cache:

  ID  URI                                  Created           Accessed
----  -----------------------------------  ----------------  ----------------
   3  jupyter-cache/docs/viz.ipynb         2020-03-04 23:06  2020-03-04 23:06
   2  jupyter-cache/docs/viz.ipynb         2020-03-04 22:58  2020-03-04 22:58
   1  jupyter-cache/docs/controller.ipynb  2020-03-04 22:52  2020-03-04 22:52

Is this expected behavior, because this is how we keep track of multiple copies of a file over time? From a UI perspective I found it a bit confusing, I was expecting one entry per file.

Now using the Python API:

  • I assume the main usage point is from jupyter_cache.cache.main import JupyterCacheBase?

  • For JupyterCacheBase, I was assuming the input path was to the folder where the cache was located, as opposed to the actual cache. Maybe this could have a quick is_dir check and, if so, check if a .jupyter_cache exists in the folder and load that if so.

  • For showing the staged/cached notebooks, it'd be helpful if there were some kind of "pretty print" option that spits out an output similar to the CLI. Maybe part of the repr for the class?

  • Merging in the outputs smooth! It was just:

    db = JupyterCacheBase('./.jupyter_cache/')
    pk, ntbk = db.merge_match_into_notebook(ntbk)
    

    right?

  • If I try to merge in to a notebook that's not in the cache, I get KeyError: <hash>. This should probably be a more user-friendly error like "The notebook you gave doesn't exist in the cache"

I hope these comments help guide API etc design, and let me know if you'd like me to open issues for some specific ones. In general this is looking great, many thanks for your hard work in getting this ready for testing.

@chrisjsewell
Copy link
Member

Is this expected behavior, because this is how we keep track of multiple copies of a file over time?

Yes, the URI is only a record of the original URI that the notebook was read from. After the read notebooks are only distinguished by their hash; the URI is just there to give some context. Perhaps if the table header was something like "Origin URI", that would make that clearer?

I assume the main usage point is from jupyter_cache.cache.main import JupyterCacheBase?

👍 (I should probably make this less deep into the package, but I recall there may be an issue to overcome with cyclic dependency)

For JupyterCacheBase, I was assuming the input path was to the folder where the cache was located, as opposed to the actual cache. Maybe this could have a quick is_dir check and, if so, check if a .jupyter_cache exists in the folder and load that if so.

I don't really see what this improves? .jupyter_cache is a directory so is_dir wouldn't work, also there's no hard dependency on the cache folder being called .jupyter_cache (in fact in the CLI you can set the path to it via the environmental variable JUPYTERCACHE). Why would you want to only specify the parent folder, rather than just directly the actual cache?

For showing the staged/cached notebooks, it'd be helpful if there were some kind of "pretty print" option that spits out an output similar to the CLI. Maybe part of the repr for the class?

Yep, can just upstream the CLI function into an API method 👍

Merging in the outputs smooth! It was just: right?

👍

If I try to merge in to a notebook that's not in the cache, I get KeyError: . This should probably be a more user-friendly error like "The notebook you gave doesn't exist in the cache"

👌

I hope these comments help guide API etc design, and let me know if you'd like me to open issues for some specific ones. In general this is looking great, many thanks for your hard work in getting this ready for testing.

👍 👌 😄 🎉

@chrisjsewell
Copy link
Member

chrisjsewell commented Mar 8, 2020

@choldgraf
Copy link
Member Author

None others than I can think of! I will try building this in to myst-nb tomorrow so maybe that'll uncover some other thoughts.

I think the question about "should it be a direct path to the cache or to the parent folder?" is something I feel 50/50 on. As I mentioned above I was mostly just trying to document what I intuitively assumed vs. what the reality turned out to be. I agree that we should assume documentation can clear these things up, but we also might as well make the API match intuition as much as possible. Perhaps @mmcky or others could weigh in on that particular design decision, I think it's worth just having more people try it out.

@AakashGfude
Copy link
Member

None others than I can think of! I will try building this in to myst-nb tomorrow so maybe that'll uncover some other thoughts.

Hey @choldgraf , here is a minimal implementation of this being build tn to myst-nb :- executablebooks/MyST-NB#55 . Sorry, for the late PR.Got caught up in university life 😅.

@chrisjsewell , regarding #29 , you already have an easy util function to access it :- https://github.com/ExecutableBookProject/jupyter-cache/blob/master/jupyter_cache/cli/utils.py#L12 . So, the issue might not be necessary?

@chrisjsewell
Copy link
Member

See 14f7a90 for fixes to some of the issues brought up

@chrisjsewell
Copy link
Member

See https://jupyter-cache.readthedocs.io/en/latest/using/api.html for a full rundown of the API, and let me know any feedback 😄

@chrisjsewell
Copy link
Member

Anyone free to propose a new logo (I just quickly found that from google), also @choldgraf the left sidebar seems to act poorly on a number of the pages; is this an issue to be raised with the pandas-sphinx-theme?

image

@choldgraf
Copy link
Member Author

Nice - I'll try giving another run-through on the API docs soon!

let's think on a logo, copypasta from google is fine in the meantime :-)

And re: the sidebar thing, I think that is pydata/pydata-sphinx-theme#118 - I'm hoping somebody in the pandas world with more bootstrap-fu than I do can figure it out...

@jstac
Copy link
Member

jstac commented Mar 13, 2020

@DrDrij, do you have time to work on a logo for this project? You can get a sense of what it's about from the docs.

@DrDrij
Copy link

DrDrij commented Mar 13, 2020

@DrDrij, do you have time to work on a logo for this project? You can get a sense of what it's about from the docs.

Sure thing :) If there are any preferences in terms of colour pallet, shapes, or logos anyone really likes that can provide as a guide - please speak up :D

@chrisjsewell
Copy link
Member

chrisjsewell commented Mar 13, 2020

Sure thing :) If there are any preferences in terms of colour pallet, shapes, or logos anyone really likes that can provide as a guide - please speak up :D

Well I guess what we maybe want is new logos for all the EBP packages 😁 with a consistent theme, colour pallet, etc:

@jstac
Copy link
Member

jstac commented Mar 13, 2020

Would love your help with this @DrDrij --- funded by NumFOCUS of course.

For context, these projects form the meat and potatoes of EBP, shiny new version of jupinx.

@choldgraf
Copy link
Member Author

I've found the jupyter brand guide to be a helpful way of thinking about design in the jupyter ecosystem https://github.com/jupyter/design/tree/master/brandguide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants