Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AST infra rewrite #1462

Merged
merged 1 commit into from
Mar 4, 2024
Merged

AST infra rewrite #1462

merged 1 commit into from
Mar 4, 2024

Conversation

rsmmr
Copy link
Member

@rsmmr rsmmr commented Jun 16, 2023

I'm marking this as ready for review. I'll profile and polish a bit more, but generally the code should be stable at this point. The main missing piece is porting the Zeek plugin over to the new API, which I'll work on next.

This reworks the AST infrastructure to become much simpler in terms of
implementation and usage. At a high level, we get rid of (1) the
value-based storage of nodes through type erasure, and (2) all the
template magic implementing type erasure and visitors. We replace that
with a traditional "old-style" class hierarchy representing AST node
relationships, with memory management through smart pointers. ASTs are
now mutable as well.

See 1st comment below for some more detailed notes on changes
to the AST code.

See 2nd comment below for some preliminary benchmarks.

Note that when compiling existing analyzers, the new compiler is a bit more strict in what it accepts. While there aren't any language changes, it turns out that the old compiler ended up accepting some things that it shouldn't have. Specific notes for porting analyzers:

  • Identifiers from the hilt:: namespace are no longer available; usually you can just scope them with spicy:: instead and it'll work.
  • The old compiler didn't always enforce constness as it should have. In particular, some parameters could be mutable even though there weren't declared as inout. The inout is now required for all mutable operations on parameters.

@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch from 1cd0ddb to 86312a0 Compare June 27, 2023 07:38
@bbannier bbannier mentioned this pull request Jul 6, 2023
1 task
@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch from 52a6458 to ef54a3d Compare July 10, 2023 07:21
@rsmmr rsmmr changed the title Proof-of-concept AST infra rewrite [WIP] Proof-of-concept AST infra rewrite Jul 17, 2023
@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch from d7f6af6 to 52e6db5 Compare October 19, 2023 09:44
@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch from 52e6db5 to beb8fe3 Compare November 6, 2023 17:36
@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch 3 times, most recently from 87a29f3 to 4b4f889 Compare November 20, 2023 12:33
@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch from 4b4f889 to fe6caa9 Compare December 11, 2023 10:17
@rsmmr rsmmr changed the title [WIP] Proof-of-concept AST infra rewrite [WIP] AST infra rewrite Jan 10, 2024
@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch 2 times, most recently from 8d23267 to dcc6e13 Compare February 9, 2024 12:37
@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch from dcc6e13 to 950444a Compare February 20, 2024 09:35
@rsmmr
Copy link
Member Author

rsmmr commented Feb 20, 2024

Main changes:

  • Switched to traditional class hierarchy for AST nodes, with
    Node at the top.

  • Modeling type constness with a new node QualifiedType, instead
    of coding it into all Type instances. Type has been renamed
    into UnqualifiedType for easier distinguishing and so that the
    compiler will catch existing usage of the old Type that hasn't
    been updated. We also use QualifiedType to track LHS/RHS
    semantics.

  • IDs are no longer nodes themselves, just node attributes like
    other atomic values.

  • We now instantiate all nodes through static factory methods
    called create() that take an additional global ASTContext
    instance, which maintains AST-wide state. To facilitate this
    approach, all node constructors are private now; the caller must
    go through the create() methods.

  • The AST context holds all modules inside a single, global AST
    (previously, we have one AST per module). The root of this
    single AST is a node ASTRoot, and the immediate children of
    that node are the modules (declaration::Module). This
    structure significantly simplifies cross-module processing.
    Visitors now traverse the full global tree, seeing all modules
    at once.

  • To make usage of the create() methods easier, there's a new
    NodeFactory class that provides forwarding methods. The
    factory stores the relevant ASTContext internally and
    automatically adds it during forwarding. This factory also
    replaces the old builder()->* functions for creating AST nodes.

    The forwarding methods are automatically generated through
    scripts/autogen-builder-api (which, in turn, uses a
    libclang-based tool bin/autogen-builder-api).

  • Nodes now keep pointers to their parents. That replaces the
    old position_t stuff.

  • No singleton nodes anymore, they don't play well with that
    parent pointering.

  • No node::None anymore, using nullptr instead. Likewise, no
    optional<SomeNode> anymore, using nullptr as well.

  • For visitors, we're using traditional double-dispatch with
    virtual methods. No more complex template magic, and no more
    result values from visitors either. There are just two visitor
    variants now for pre-order and post-order walking (and those two
    are still implemented through a joined template). Plus, there
    are HILTI and Spicy version of each visitor. The Spicy-version
    adds additional virtual methods for Spicy-specific nodes.

  • Unresolved types: Use Auto if the type is presumably still
    going to be resolved, and Unknown if the type has been
    determined to be finally unresolvable.

  • To render AST nodes in a way that's user-presentable, use
    print() methods. In contrast, dump() is limited to dump out
    debug information, in particular the AST

  • Resolver/coercer/normalizer have all merged into a new, single
    pass called "resolver". The previous separation was confusing
    and hard to maintain.

  • The passes that were formerly in compiler/visitors/*.cc have
    moved up a level and gained their own header files to declare
    their entry points. No more global.h either.

  • No cycles are allowed inside the AST. That means a child cannot
    point back to a node that's already elsewhere in the AST. That
    also means, each node does have a unique parent. When adding a
    node to an AST that already has a parent, the node is
    deep-copied automatically (see next bullet). In case something
    goes wrong, debug builds run a cycle detector to catch problems.

  • When adding a node to an AST (i.e., as the child of another
    Node), there are two cases:

    1. It's a new node instance that does not have a parent yet.
      In that case, we directly add a pointer to that Node as the
      child.

    2. It's a node instance that already has a different parent.
      In that case, we deep-copy the node first and add a pointer
      to the copy as the child.

    Note that when wanting to move a node from one location
    inside the AST to another, you can avoid the deep-copy by first
    removing it from its old parent (which will clear the parent
    pointer) and then adding it to the new parent.

    Conceptually, there are two main places where this logic
    happens: 1. when manipulating the child of a Node through the
    corresponding Node methods; and 2. when instantiating new Node
    and given them their initial children. Generally, this all happens
    automatically when going through the corresponding methods,
    which deep-copy on-demand as needed.

  • If we semantically do need a cycle, the solution is to store a
    reference (see next bullet) to the target inside an explicit
    node attribute that's not part of the normal parent/child
    pointering. Examples of nodes doing this are expression::Name,
    type::Name, and QualifiedType. The latter has the notion of
    "external" types where the wrapped UnqualifiedType is not a
    child, but stored somewhere else inside the AST; the
    QualifiedType just stores a reference to it, which it
    transparently unwraps when the UnqualifiedType is requested
    (through type()).

  • The "references" mentioned above are implemented through
    mappings inside the ASTContext. Currently, this is supported for
    declarations and types. First, one registers the
    declaration/type with the AST context. That assigns it a unique,
    stable index ID. That index acts as the reference that nodes can
    store as attributes. To dereference, one can later ask the context to
    provide the declaration/type that corresponds to the stored index.

  • Non-abstract classes derived from Node

    • Public create() factory
    • Protected constructors, called from create()
    • Protected copy/move constructors/assignment that perform shallow copies.
  • We do type comparison by computing a serialized version of each
    type, call "type unification". We then comparing those
    unifications as strings. An empty unification always compares
    false against anything else.

  • When comparing types, we distinguish between types that are
    compared structurally (most) and by name ("name types"; e.g.,
    struct and enums). A virtual type method isNameType()
    indicates of which kind a type is.

@rsmmr
Copy link
Member Author

rsmmr commented Feb 20, 2024

Some preliminary benchmarks (all with release builds):

Code generation (from .spicy to C++):

tests/spicy/types/unit/basic.spicy

master  0.18s real    0.18s user    0.00s sys
branch  0.05s real    0.04s user    0.00s sys
spicy-http/analyzer/analyzer.spicy

master  2.33s real    2.32s user    0.00s sys
branch  0.56s real    0.54s user    0.01s sys
spicy-dns/analyzer/analyzer.spicy

master  1.82s real    1.80s user    0.00s sys
branch  0.51s real    0.49s user    0.01s sys
<large internal analyzer>

master  78.23 real    76.49 user    1.61 sys
branch  36.53 real    29.36 user    5.38 sys

Build times for the toolchain:

master    1m55.72s real    14m5.54s user    35.75s sys
branch    1m24.47s real    12m6.85s user    37.35s sys

Max RSS while building the toolchain:

master    2407M
branch    1234M

Binary sizes:

          libhilti.dylib    libspicy.dylib
master    20.0M             9.8M
branch     9.1M             4.0M

@rsmmr rsmmr changed the title [WIP] AST infra rewrite AST infra rewrite Feb 20, 2024
@rsmmr rsmmr marked this pull request as ready for review February 20, 2024 10:52
@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch from dc47214 to 58dfb9f Compare February 23, 2024 09:19
Copy link
Member

@bbannier bbannier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a pass, but likely missed a lot. Overall this looks all much more natural and also much simpler to me, so really looking forward to using it. I am really a fan of the more accessible global information. Thanks for all the work.

Couple global points:

  • I believe the new factory functions could all take their args by value
  • Unless we have a good reason not to we should use std::make_shared in factory functions
  • I suspect there might be opportunities to have getters on nodes to return raw ptrs instead of *Ptr types since most often callers want to only inspect, but not share ownership (e.g., in the base classes for Expression::type, but also in many derived classes which call child which would need an another child function returning a non-owning ptr). I haven't checked in detail though.
  • We didn't heavily use final before and I am not sure we care about sealing interfaces. I don't have a strong opinion though.
  • This PR leaves the copyright headers in inconsistent state, could you make a pass updating them to e.g., 2020-2024?

scripts/autogen-builder-api Outdated Show resolved Hide resolved
scripts/autogen-operators Outdated Show resolved Hide resolved
scripts/autogen-builder-api Outdated Show resolved Hide resolved
scripts/autogen-builder-api Outdated Show resolved Hide resolved
scripts/autogen-operators Outdated Show resolved Hide resolved
hilti/toolchain/include/ast/type.h Outdated Show resolved Hide resolved
hilti/toolchain/include/ast/type.h Outdated Show resolved Hide resolved
hilti/toolchain/include/ast/type.h Show resolved Hide resolved
hilti/toolchain/include/base/cache.h Outdated Show resolved Hide resolved
hilti/toolchain/include/base/uniquer.h Outdated Show resolved Hide resolved
@rsmmr
Copy link
Member Author

rsmmr commented Feb 26, 2024

  • I believe the new factory functions could all take their args by value

I often followed clang-tidy recommendations, though honestly not always understanding them. Either way, I'm still planning to move away from shared_ptr to raw pointers, and most of these then become mood at that point, so I would prefer to leave as is for now.

  • Unless we have a good reason not to we should use std::make_shared in factory functions

make_shared cannot use the private constructors. But this will also go away when moving to raw pointers.

  • I suspect there might be opportunities to have getters on nodes to return raw ptrs instead of *Ptr types since most often callers want to only inspect, but not share ownership

Probably, though I'd hope the compiler can optimize much of that away. But once again same argument: we'll soon have only raw pointers anyways.

  • We didn't heavily use final before and I am not sure we care about sealing interfaces. I don't have a strong opinion though.

Yeah, I was considering that and ended up thinking it doesn't really matter. If external parties started depending on this , it would be different, but I don't see that right now, and would rather take the optimization potential.

  • This PR leaves the copyright headers in inconsistent state, could you make a pass updating them to e.g., 2020-2024?

Will do.

@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch 4 times, most recently from 70087ab to 7fc5fc0 Compare February 27, 2024 17:03
@rsmmr
Copy link
Member Author

rsmmr commented Feb 29, 2024

@bbannier ready for another review (except that it seems we might have still GCC problems on Debian 10).

@rsmmr
Copy link
Member Author

rsmmr commented Feb 29, 2024

Added one more commit to fix that Debian 10 issue.

bbannier
bbannier previously approved these changes Feb 29, 2024
hilti/toolchain/include/ast/node.h Outdated Show resolved Hide resolved
hilti/toolchain/src/ast/ast-context.cc Outdated Show resolved Hide resolved
hilti/toolchain/src/compiler/driver.cc Outdated Show resolved Hide resolved
@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch 2 times, most recently from 46fb77f to 7fe5b35 Compare March 1, 2024 10:31
@rsmmr
Copy link
Member Author

rsmmr commented Mar 1, 2024

Squashed and rebased to main.

@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch from 7fe5b35 to 414149a Compare March 1, 2024 16:42
This is a large revamp of compiler internals, cleaning up and speeding
up lots of the AST pipeline. From a user perspective, nothing changes,
except that the new compiler is a tiny bit more strict: turns out that
in rare cases the old compiler ended up accepting some ill-formed
Spicy code that is now (rightfully) rejected. Specifically, two
instances of this are known where existing Spicy code may need
tweaking now:

- Identifiers from the (internal) `hilti::` namespace are no longer
  accessibly. Usually you can just scope them with `spicy::` instead,
  and it'll work.

- The old compiler didn't always enforce constness as it should have.
  In particular, function parameters could end up being mutable even
  when they weren't declared as `inout`. Now `inout` is required for
  supporting any mutable operations on a parameter, so make sure to
  add it where needed.

For the record, the following is a summary of the main internal
changes between new and old archictures:

- Switched to traditional class hierarchy for AST nodes, with `Node` at the
  top.

- Modeling type constness with a new node `QualifiedType`, instead of coding it
  into all `Type` instances. `Type` has been renamed into `UnqualifiedType` for
  easier distinguishing and so that the compiler will catch existing usage of
  the old `Type` that hasn't been updated. We also use `QualifiedType` to track
  LHS/RHS semantics.

- IDs are no longer nodes themselves, just node attributes like other atomic
  values.

- We now instantiate all nodes through static factory methods called `create()`
  that take an additional global `ASTContext` instance, which maintains
  AST-wide state. To facilitate this approach, all node constructors are
  private now; the caller must go through the `create()` methods.

- The AST context holds all modules inside a single, global AST (previously, we
  have one AST per module). The root of this single AST is a node `ASTRoot`,
  and the immediate children of that node are the modules
  (`declaration::Module`). This structure significantly simplifies cross-module
  processing. Visitors now traverse the full global tree, seeing all modules at
  once.

- To make usage of the `create()` methods easier, there's a new `NodeFactory`
  class that provides forwarding methods. The factory stores the relevant
  `ASTContext` internally and automatically adds it during forwarding. This
  factory also replaces the old `builder()->*` functions for creating AST
  nodes.

  The forwarding methods are automatically generated through
  `scripts/autogen-builder-api` (which, in turn, uses a `libclang`-based tool
  `bin/autogen-builder-api`).

- Nodes now keep pointers to their parents. That replaces the old `position_t`
  stuff.

- No singleton nodes anymore, they don't play well with that parent pointering.

- No `node::None` anymore, using `nullptr` instead. Likewise, no
  `optional<SomeNode>` anymore, using `nullptr` as well.

- For visitors, we're using traditional double-dispatch with virtual methods.
  No more complex template magic, and no more result values from visitors
  either. There are just two visitor variants now for pre-order and post-order
  walking (and those two are still implemented through a joined template).
  Plus, there are HILTI and Spicy version of each visitor. The Spicy-version
  adds additional virtual methods for Spicy-specific nodes.

- Unresolved types: Use `Auto` if the type is presumably still going to be
  resolved, and `Unknown` if the type has been determined to be finally
  unresolvable.

- To render AST nodes in a way that's user-presentable, use `print()` methods.
  In contrast, `dump()` is limited to dump out debug information, in particular
  the AST

- Resolver/coercer/normalizer have all merged into a new, single pass called
  "resolver". The previous separation was confusing and hard to maintain.

- The passes that were formerly in `compiler/visitors/*.cc` have moved up a
  level and gained their own header files to declare their entry points. No
  more `global.h` either.

- No cycles are allowed inside the AST. That means a child cannot point back to
  a node that's already elsewhere in the AST. That also means, each node does
  have a unique parent. When adding a node to an AST that already has a parent,
  the node is deep-copied automatically (see next bullet). In case something
  goes wrong, debug builds run a cycle detector to catch problems.

- When adding a node to an AST (i.e., as the child of another Node), there are
  two cases:

    1. It's a new node instance that does not have a parent yet. In that case,
    we directly add a pointer to that Node as the child.

    2. It's a node instance that already has a different parent. In that case,
    we deep-copy the node first and add a pointer to the copy as the child.

   Note that when wanting to *move* a node from one location inside the AST to
   another, you can avoid the deep-copy by first removing it from its old
   parent (which will clear the parent pointer) and then adding it to the new
   parent.

   Conceptually, there are two main places where this logic happens: 1. when
   manipulating the child of a Node through the corresponding Node methods; and
   2. when instantiating new Node and given them their initial children.
   Generally, this all happens automatically when going through the
   corresponding methods, which deep-copy on-demand as needed.

- If we *semantically* do need a cycle, the solution is to store a reference
  (see next bullet) to the target inside an explicit node attribute that's not
  part of the normal parent/child pointering. Examples of nodes doing this are
  `expression::Name`, `type::Name`, and `QualifiedType`. The latter has the
  notion of "external" types where the wrapped `UnqualifiedType` is not a
  child, but stored somewhere else inside the AST; the `QualifiedType` just
  stores a reference to it, which it transparently unwraps when the
  `UnqualifiedType` is requested (through `type()`).

- The "references" mentioned above are implemented through mappings inside the
  ASTContext. Currently, this is supported for declarations and types. First,
  one registers the declaration/type with the AST context. That assigns it a
  unique, stable index ID. That index acts as the reference that nodes can
  store as attributes. To dereference, one can later ask the context to provide
  the declaration/type that corresponds to the stored index.

- Non-abstract classes derived from Node:
    - Public `create()` factory
    - Protected constructors, called from `create()`
    - Protected copy/move constructors/assignment that perform shallow copies.

- We do type comparison by computing a serialized version of each type, call
  "type unification". We  then comparing those unifications as strings. An
  empty unification always compares false against anything else.

- When comparing types, we distinguish between types that are compared
  structurally (most) and by name ("name types"; e.g., struct and enums). A
  virtual type method `isNameType()` indicates of which kind a type is.
@rsmmr rsmmr force-pushed the topic/robin/node-rewrite branch from dd7e633 to 2843729 Compare March 4, 2024 09:01
@rsmmr rsmmr merged commit b1c3c4e into main Mar 4, 2024
10 of 19 checks passed
@rsmmr rsmmr deleted the topic/robin/node-rewrite branch March 4, 2024 09:12
bbannier added a commit that referenced this pull request Jul 16, 2024
We previously would evaluate whether a unit field or method was required
by an optional feature before we had finished collecting all feature
requirements. This behavior was fine when we were visiting individual
modules' AST on by one, but breaks with #1462 were we changed using a
single AST to hold all modules.

This patch defers transformations until all feature requirements have
been collected.
bbannier added a commit that referenced this pull request Jul 16, 2024
We previously would evaluate whether a unit field or method was required
by an optional feature before we had finished collecting all feature
requirements. This behavior was fine when we were visiting individual
modules' AST one by one, but breaks with #1462 were we changed using a
single AST to hold all modules.

This patch defers transformations until all feature requirements have
been collected.
bbannier added a commit that referenced this pull request Jul 16, 2024
We previously would evaluate whether a unit field or method was required
by an optional feature before we had finished collecting all feature
requirements. This behavior was fine when we were visiting individual
modules' AST one by one, but breaks with #1462 were we changed using a
single AST to hold all modules.

This patch defers transformations until all feature requirements have
been collected.
bbannier added a commit that referenced this pull request Jul 16, 2024
We previously would evaluate whether a unit field or method was required
by an optional feature before we had finished collecting all feature
requirements. This behavior was fine when we were visiting individual
modules' AST one by one, but breaks with #1462 were we changed using a
single AST to hold all modules.

This patch defers transformations until all feature requirements have
been collected.
bbannier added a commit that referenced this pull request Jul 30, 2024
We did this previously but stopped doing it with #1462.
bbannier added a commit that referenced this pull request Jul 31, 2024
We did this previously but stopped doing it with #1462.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants