Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stricter lark grammar for notebook documents #143

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

tonyfast
Copy link
Member

this pull request modifies the json parser to use explicit properties from the notebook format to constrain the parser.

the top-level notebook keys and cell keys are known ahead of time because of the schema that notebooks conform to. these changes hard code the notebook and cell properties into the parser.

this change will NOT work for arbitrary data formats, only those conforming to subsets of the specified schema. otherwise the parser throws errors.

this change streamlines the transformation logic.

@tonyfast
Copy link
Member Author

tonyfast commented Nov 8, 2023

i'm worried about merging this one when it does land. i think i'd feel better about the support if i ran the parser through some hypothesis schema jawn.

@tonyfast
Copy link
Member Author

tonyfast commented Nov 8, 2023

this parser is super strict at the notebook and cell levels. after that the parser is free parse, but it will only pick up the source parts at the moment.

i'd like to harden it up with more tests. some questions i have are:

  • are there more tests i could add to check more cases or provide assurance?
  • does introducing hypothesis change the tune when considering parallelism in the tests?

@tonyfast tonyfast requested a review from bollwyvl November 8, 2023 07:05
@bollwyvl
Copy link
Contributor

bollwyvl commented Nov 8, 2023

i'm worried about merging this one when it does land.

If not architecturally impossible, a common pattern when making something stricter would be to build, test, and ship both parsers, and initially make the strict one opt-in, then the default, then, maybe, deprecate the old one.

The parsers could be registered with entry_points, which would also allow a downstream to ship their own parser, prebuilt or otherwise. Indeed, having lark in the loop would be very interesting indeed for the interactive case.

In addition to clear docs on how that works in e.g. README.md (and probably a section in CHANGELOG.md), the API might then look like:

with Notebook():
    import a_somehow_bad_notebook
    # UserWarning: 
    #    a_somehow_bad_notebook failed strict parsing, on line {line}, column {line}
    #
    #        {lark error}
    # 
    #    The loose parser was used, but {features} will not be available. Silence this warning with:
    #    
    #        with Notebook(parser='loose'):
    #
    #    see https://github.com/deathbeds/importnb/issues/{number}

Then offer:

with Notebook(parser="loose"):
    import a_somehow_bad_notebook

And allow opting-in to hard fails:

with Notebook(parser="strict"):
    import a_somehow_bad_notebook
    # ImportnbParseError: a_somehow_bad_notebook.ipynb failed to parse on line {line}, column {line}
    # {lark error}

Where ImportNbParseError is a subclass of ImportError (or something).

i think i'd feel better about the support if i ran the parser through some hypothesis schema jawn.

hypothesis is always a win.

does introducing hypothesis change the tune when considering parallelism in the tests?

Yes, hypothesis-based suites can benefit from parallelism with pytest-xdist, if run together with others suites.
In CI, I'd probably treat it as a separate set of tests, either in a different folder or markers, which wouldn't even run if the fast tests fail, and likely just as part of a the fastest job (e.g. only on linux/3.12), and also enable a --cov-fail-under threshold.

But the above is irrelevant until the existing tests all run cleanly, ideally without doing anything to disk, sys.path tricks, etc.

@tonyfast
Copy link
Member Author

tonyfast commented Nov 9, 2023

yea i love this idea. i think i can manage it in this PR. we'll have to build two different standalone parsers to ship for a few months then sunset the more permissive one.

ultimately, i think this should work out cause both the notebook and cell levels define additionalProperties: False. as long as all the properties are captured in the grammar we should be good. there is a case where folks synthetically generate notebook like structure or introduce future keys like $schema or @graph @context. in these cases, we'd need a permissive flag turned on if an only if this synthetic data is stored on disk. in theory, the stricter grammar could host both a strict and permissive version. in the permissive version we all for additionalProperties by supporting generic keys.

@bollwyvl
Copy link
Contributor

bollwyvl commented Nov 9, 2023

two different standalone parsers to ship

Yeah, this doesn't seem like a huge hit to avoid breaking things, and getting new features.

months then sunset the more permissive one.

Right, as long as there is clear messaging from the as-installed software that this is going to happen, and a timeline/trigger (e.g. first release that supports 3.13) for when. The ad-hoc CalVer scheme can make it hard to plan for these things in any way other than after this day... but... can't go back now without a package rename or creating huge problems downstream.

generate notebook like structure or

Sure, as long as not losing features for the 99% use case of nbformat-compliant structures. And further: formalizing and shipping the .g files, it would be possible to make extensible grammars in interesting ways.

So if some use case requires going wild, it could subclass the machinery (even interactively), by overloading the right level of the grammar/decoder/parser/transformer/whatever (these need some .mmd to explain them, perhaps) and passing that instead of an entry_point name, without having to mess about with entry_points or the standalone tooling generator.

If the anticipated needs are very predictable, having a more constrained API Notebook(additional_properties={"$schema": "object"}), etc. could make this feasible, provided importnb[lark] is installed, providing for a few simple cases.

A more formalized system could package and register it by normal means, which should be kept if added.

introduce future key

Changes like that happen slowly, but not worth losing features. We could add an nbformat canary build excursion that installs from upstream's main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants