Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The road to a custom parser #1363

Open
2 of 3 tasks
jacqueswww opened this issue Mar 20, 2019 · 23 comments
Open
2 of 3 tasks

The road to a custom parser #1363

jacqueswww opened this issue Mar 20, 2019 · 23 comments
Labels
enhancement work in progress Work on this PR or issue is not yet complete but reviewers are free to add their input for guidance.

Comments

@jacqueswww
Copy link
Contributor

jacqueswww commented Mar 20, 2019

The road to a custom parser

As we have discussed this at lengths as parts of weekly calls, this is the proposed road map to swap out the current python standard library. Also lots of discussion on github (see #563).

  • Step 1
    Create own custom AST node classes to represent our AST:

  • Step 2
    After having our own AST classes (and improving readability by seeing where nodes originate from).

    • create serializer of our AST nodes, preferably JSON (both encode & decode)
    • add type annotation to each class (where appropriate)
    • create a type checking step straight on the VyperAST nodes
    • add any other needed AST functionality on classes directly
  • Step 3
    Implement other parses that produce or translate to VyperAST classes.

    • This is quite open but allows us to choose any type of parsing library or parser generator at this point, without making vyper code generator dependant on the library.

@jacqueswww
Copy link
Contributor Author

I plan on starting with Step1 some time next week ;) let me know if I should add anythin.

@jacqueswww jacqueswww added enhancement work in progress Work on this PR or issue is not yet complete but reviewers are free to add their input for guidance. labels Mar 20, 2019
@pipermerriam
Copy link
Contributor

Some naming suggestions to closer conform to what I think are standard-ish names (feel free to ignore any of these an use your best judgement).

  • GtE -> Ge
  • Compare -> RelOp
  • Mult -> Mul
  • LtE -> Le
  • NotEq -> Ne

And maybe missing

  • Else
  • ElseIf

Forgive any lack of context on my part for these suggestions. Very much like the direction of this.

@jacqueswww
Copy link
Contributor Author

jacqueswww commented Mar 20, 2019

@pipermerriam definitely keen to renaming the classes to be more sensible (and less abbreviated) ! Will add link to this on Step 1 (or maybe will make it part of Step 2, to make the PR smaller - will see).

Side Note: Else is quite interesting, it's all part of If currently using ast.If.orelse .

@davesque
Copy link
Contributor

@pipermerriam Yeah, I would expect else to be a property on an If ast object; rather like body is a list of statements on a function definition object.

@fubuloubu
Copy link
Member

I think it is noted, but exposing an API to get the AST for a given source program would be quite handy for Step 1.

For Step 3, my wishlist might include being able to "extend" the Vyper module by inserting tools "in between" stages of the compiler, such as advanced static analysis on the AST that may be able to reduce the AST nodes through some custom rewrite rules, then inserting the optimized AST back into the next stage of the compiler (code translation to IR). Similar idea for the IR as well (for example, symbolic analysis-based optimizations that exhaust all possible paths and remove assertion checks as impossible)

@davesque
Copy link
Contributor

@pipermerriam Also, not sure if you noticed but I think @jacqueswww chose those names largely based on the naming scheme in the python standard ast library.

@pipermerriam
Copy link
Contributor

👍 on ignoring my name suggestions, or only using the ones you like. I don't have strong feelings on any of them.

@davesque
Copy link
Contributor

davesque commented Mar 21, 2019

I've begun some experimental work on a parser using lark. I've got it hosted here for the time being. It might also be wise to create a CST parser using TatSu since that seems like a very good potential alternative to lark. However, I'm using lark for now since it was easier to get up and going with it.

I've currently got a working parser that parses code files into a CST. I've begun adding some custom AST classes so that I have something to convert the CST into. After that's done, some routines will need to be written to convert the CST into an AST represented with AST class instances. This file in the cpython implementation should act as a good guide for that: https://github.com/python/cpython/blob/master/Python/ast.c .

@davesque
Copy link
Contributor

Also, a note about my repo that I linked to. It's really just meant to act as a place to store code that I'm hacking on. So there are no tests and no real structure to the repo. So don't freak out :).

@davesque
Copy link
Contributor

davesque commented Mar 23, 2019

Custom Vyper AST classes in the vyper_parser repo are done (see here). Currently, they attempt to mirror the naming scheme and structure of the ASDL-derived Python AST classes as precisely as possible.

As a quick aside, Python's ASDL definitions are found here. CPython uses a specific grammar to define its collection of AST classes. The C implementation parses the ASDL grammar and uses it to create all the AST classes using the C API.

Next, I'll work on defining some routines to convert the lark parse tree into an AST composed of our custom AST classes. Currently, I'm targeting the entire Python AST. I think this will be useful in verifying the correctness of the implementation. Our test routines can just compare the results of Python's native ast.parse function to our custom parser for all python files in the standard library. After this is done, we can cut back the definition to just what we need.

This was referenced Mar 25, 2019
@davesque
Copy link
Contributor

I went ahead and migrated my code into a proper project (found here). I don't necessarily intend to have this work exist in a separate package, but I wanted to take advantage of our project template test rigging. I added a test that automatically generates fixtures from python files in the standard library.

@davesque
Copy link
Contributor

Here's a fun development from today. For the string literal parsing portion of our custom parser, I'm going to just defer to python's parsing facilities. Parsing string literals is pretty complicated and I think we can safely just use python's parser and convert the result to a tree of our custom AST classes. In order to do this, I needed to write some code to convert python AST trees into trees of our Vyper-specific AST classes. Here it is: https://github.com/davesque/vyper-parser/blob/master/vyper_parser/ast.py#L59

It passes tests for all python files in my standard lib directory: https://github.com/davesque/vyper-parser/blob/master/tests/test_ast.py#L21

This is something we could potentially begin using today if we're interested in converting Vyper to using custom AST classes. Although I'd probably recommend holding off a little longer in case I want to change some things before committing to the API that I've come up with.

@davesque
Copy link
Contributor

The VyperAST.from_python_ast method I just mentioned has one potential issue at this point. It automatically infers the proper VyperAST subclass to use for an instance of python's standard ast.AST subclasses. It does this by automatically collecting all subclasses of VyperAST. This means that someone could potentially create one of their own subclasses of VyperAST and get weird behavior when it happens to shadow an existing class name. This can be fixed by just explicitly creating a mapping from python AST classes to vyper AST classes. Just wanted to make a note of that here.

@jacqueswww
Copy link
Contributor Author

@davesque very cool! It's coming together nicely, I like the layout of the VyperASTs - I suggest we adopt those, but we start with empty the _fields on each classes init. Then we only handle the fields we need :)

@rocky
Copy link

rocky commented Apr 1, 2019

Added a comment to https://gist.github.com/jacqueswww/aa24d22d95a578429e0f25f5f2de0b36 which I won't duplicate here. This gist is to follow solc's AST where it makes sense (e.g. in how to specify a location in a general, flexible and compact way), that maybe some thought should be given to how this is specified when writing to JSON so as to be compatible again with solc's JSON AST.

@jacqueswww
Copy link
Contributor Author

@rocky this id field is unique per node? Or can nodes share ID's, if not I am not sure how would one would make the deterministic - unless there is a global auto increment counter on each file compiled (global context in our code base).

@davesque
Copy link
Contributor

davesque commented Apr 1, 2019

@jacqueswww I think all of that is clarified in the link that @rocky posted in his gist comment: https://solidity.readthedocs.io/en/develop/miscellaneous.html#source-mappings

@fubuloubu
Copy link
Member

Really cool extra for hypothesis that can generate strategies for a lark grammar: https://hypothesis.readthedocs.io/en/latest/extras.html#hypothesis-lark

@fubuloubu fubuloubu mentioned this issue Jul 15, 2020
3 tasks
@fubuloubu
Copy link
Member

Thanks to @iamdefinitelyahuman for helping us achieve Step 2 here!

@jacqueswww
Copy link
Contributor Author

Thanks to @iamdefinitelyahuman for helping us achieve Step 2 here!

excited

@0xalpharush
Copy link

I think this is a worthwhile investment. Many of the issues I've opened recently have to do with trying to use the Python AST while supporting syntax in Vyper that is semantically different than the node produced by the Python parser. I think it would simplify the compiler by making AST nodes single purpose, removing the need to interoperate with changes to the Python AST. It would also mitigate issues like #3475 since a custom parser would require types and not backfill the AST of a dynamically type language with type annotations. I would imagine there's also the added benefit of no longer having to worry about anything related to Python when discussing features and syntax.

@fubuloubu
Copy link
Member

Definitely something we want to do, we have a Lark grammar that's been tested and a part of the core now for over a year. We haven't had issues with it for a while.

The biggest problem with custom grammar is that we would get a ton of parsing issues grieving the users, the python ast solves that by being extremely well used and debugged. Hopefully could work on switching over to it as a part of a major revision

What's still missing is that even though the parser parses reliably, we have not yet done the work of having it produces Vyper AST nodes or checked for equivalency with the current parser

That would be next steps before we can swap it in
(Taking any contributors)

@charles-cooper
Copy link
Member

I would imagine there's also the added benefit of no longer having to worry about anything related to Python when discussing features and syntax.

This is a double edged sword, really. Actually the main reason I haven't gotten rid of the existing parser flow so far is not technical, it's that in some sense it "keeps us honest" -- we can't really break away from python syntax too easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement work in progress Work on this PR or issue is not yet complete but reviewers are free to add their input for guidance.
Projects
None yet
Development

No branches or pull requests

7 participants