Data Oriented Design for the AST #139

saem · 2021-12-29T06:59:25Z

saem
Dec 29, 2021
Maintainer

Discussion for roadmapping out the move of the compiler internals to a data oriented design approach.

(this is an evolving summary of the discussion below)

Big Idea

A data oriented design approach to compiler internals, starting first and foremost with the AST would be a significant step towards cleaning up the code base, speeding up compilation, unlocking IC, fixing bugs, making it more approachable for others, and much much more. But that's a lot of scope for what need to be humble beginnings.

Where to start

Currently the compiler uses PNode, PSym, PType, and a bunch of other refs to data, which are passed around forgotten (garbage collected), copied, mutated etc... as part of semantic analysis and backend code generation. We want to move the memory layout of AST towards a DOD approach, which would mean a single data structure, with storage package/optimized for a nim project's AST needs. Starting with PNode this would look something like this:

type
  NodeId = distinct int

  PNode* = object
    astState: AstStateRef
    id: NodeId

  AstState = object
    nodes: seq[NodeData]
    # types, syms, and all the various attributes like info in dense storage

  NodeData = object
    kind: NodeEnum             # maybe reuse NimNodeKind, not recommended
    left, right: NodeLeftRight # interpreted based on the kind

  NodeLeftRight = distinct int # meaning depends on the node kind

Few key notes:

existing API compatibility would be kept via accessors
copying nodes should be tracked as we can detect "forks"
updating nodes ("production" in compiler speak) implies a few things:
- note an id of the old stuff so we can go back (immutability)
- where a copy is replacing a node we know this to be a join:
  - which might be generic instantiation, template/macro eval, etc

Why carry around a ref to the "global" ast state instead of just keeping it as a global? This is because we have metaprogramming, which is basically compilation within compilation. The expectation is that the precise data that an ID has along with it will evolve over time and instead of pointing to "everything" it will point to the particular compilation it's a part of. This will make it easier to control environments, roll forward and backwards symbol tables, and many other useful features.

What is Data Oriented Design (DOD)?

It's not quite a formalism, so it's hard to pin down. Here is a definition by analogy with some relevant references following that

tl;dr: Data Oriented Design is treating your data layout and storage in memory like a database, with a slight bias towards column oriented vs row oriented storage (structs of arrays). With additoinal considerations such as instead of refs or pointers to memory favour opaque identifiers (array/sequence/keys offsets) that act much like primary and foriegn keys would in a database. Following this approach a module should know all of its data and hand out aforementioned ids to data with necessary accessors for various work -- put another way a module encapsulates data and memory management.

Zig's recent work in redesigning their AST internals
Intro to DOD video for the so inclinted, it's not Mike Acton
it's inspired heavily from the book by the same name

haxscramper · 2021-12-29T07:27:11Z

haxscramper
Dec 29, 2021
Maintainer

why exactly reuse of the Nim node kind is not recommended?
you want to factor out subnodes into left and right parts - why not keep the old approach of sons: seq[NodeId] (possibly with small vector optimization if we find it necessary)? This would keep current (very intuitive) structure of the AST and avoid very awkward translation of the macro API.
or do you want to use the left/right nodes to denote ranges of the subnodes in the node sequence?

4 replies

haxscramper Dec 29, 2021
Maintainer

type
  TNodeData = object
    case kind*: TNodeKind:
      of nkLiteralKinds:
        token*: TokenId # Id of the single token

      of nkKindsWithSubnodes:
        start, finish*: NodeId

saem Dec 30, 2021
Maintainer Author

why exactly reuse of the Nim node kind is not recommended?

Because NimNodeKind contains many many many kinds that have nothing to do with the parsed AST output and we might well want more precise kinds. That was a lesson from a previous attempt at a DOD approach of mine, but I'm less sure it entirely applies here..

For example, here is a list of node kinds produced by parse. Now when I went to create a data oriented ast representaiton before, I found having slightly different nodes was helpful, see this example of mapping block, discard, continue, break, return, and raise and how extra precision helps -- the ankXYZ enum values are the new enum kinds. These more precise nodes reduced the amount of branching and logic required later on when making decisions for queries or traversals of the AST.

This property might not matter as much today if we're more aiming for inching towards data oriented design; but is definitely useful as we update the APIs and approach of sem and friends.

you want to factor out subnodes into left and right parts - why not keep the old approach of sons: seq[NodeId] (possibly with small vector optimization if we find it necessary)? This would keep current (very intuitive) structure of the AST and avoid very awkward translation of the macro API.

API wise I don't want people to change how they access things, so if we have code that does n.sons, then behind the scenes we'd grab all children from left to right and walk over them (left is an index into a lookaside array and right is how many), or if it's a unary or binary node then it's an easy fetch of ids from left and right. There is an example of mapping some of the current node structure into a lookaside structure here.

It breaks down to a few cases:

no children, left and right are wasted, but this is exceptionally rare
small literals: can just be stored in left and/or right directly (bools, chars, enums, floats, ints, even tiny strings...)
single child: store left and ignore right
two child: left is first, right is second node ids
three or more: left is the starting index of the lookaside data, and right is the offset, that lookaside data range stores the indicies of immediate children
large literals: left and right could be string start stop; or left is instance information index and right is offset into a byte buffer; etc...

An additional optimization if we can preserve it, is that since the first child is always following current, we only need to store second to last child information in the lookaside. Though I think that might be difficult to ensure in the beginning.

With all that said, looking through the old code and thinking about the excellent quesitons you raised I believe we could live with an OrderedTable[NodeId, seq[NodeId]] in order to track sons for now, even though it's not "optimal" and update it later. The left/right approach will help when baseline speed and certain operations will need enough performance to be feasible, like snapshot/rewind/explain, some queries, etc... are required.

or do you want to use the left/right nodes to denote ranges of the subnodes in the node sequence?

Yes that's the idea, see the case breakdown above. How much change we introduce at once will of course tradeoff various complexity and the more I think about it, the more I realize the number of unknowns. 😅

saem Dec 30, 2021
Maintainer Author

alternate approach based on the above questions:

# starting at line 764 of ast.nim
  #----------------------------------------------------------------------------
  # DOD AST Draft - Start
  #----------------------------------------------------------------------------

  PNodeNew* = object
    ## initial conversion point of AST to a data oriented design; rename to
    ## `PNode` and swap out the old `PNode` for this
    id: NodeId
    stateRef: StateRef

  NodeData = object
    ## bare minimum data we need to know about every node
    kind: TNodeKind ## presently the same as the nim node
    infoId: InfoId  ## so we can share info between nodes, symbols, and types

  AstTree = OrderedTable[NodeId, seq[NodeId]]
    ## store the tree structure for nodes, this is effectively `sons`
    ## xxx: in here or separately we'll need to account for modules & fragments

  State = object
    astData: AstTree           ## actual tree structure for the various AST

    nodeList: seq[NodeData]    ## each node, NodeId is their sequence index
    nodeFlags: seq[TNodeFlags] ## flags for each node, rarely access but bloats

    nodeSym: SparseNodeTbl[PSym]         ## symbols for some nodes
    nodeIdt: SparseNodeTbl[PIdent]       ## identifiers for some nodes
    nodeTyp: SparseNodeTbl[PType]        ## types for some nodes
    nodeInt: SparseNodeTbl[BiggestInt]   ## int literals
    nodeFlt: SparseNodeTbl[BiggestFloat] ## float literals
    nodeStr: SparseNodeTbl[string]       ## string literals

    infoList: seq[TLineInfo]   ## shared info between nodes, symobls, and types
  StateRef = ref State

  # Actually implementing, would go soemthing like:
  # 1. repeat this pattern for P/TSym and P/TType
  # 2. drop old P/TNode, P/TSym, and P/TType
  # 3. implement inline accessors to replace fields
  #    a. bash compiler until it compiles
  # 5. discover zany uses of the P/TNode and friends...
  #    a. bash compiler until it compiles
  # 6. run all the tests
  #    a. bash compiler until tests pass

  # IDs
  NodeId = distinct int        ## the minimum amount of data identifying a node
                               ## an index into the node sequence
  InfoId = distinct int        ## used by Nodes & Symbols for line info
                               ## an index into the info sequence

  # Convenience
  SparseNodeTbl[L] = OrderedTable[NodeId, L] ## convenice for sparse node data
                                            ## where not all nodes require it

  #----------------------------------------------------------------------------
  # DOD AST Draft - End
  #----------------------------------------------------------------------------

saem Dec 31, 2021
Maintainer Author

Here is a PR with a partial implementation and conversion, the issue right now is the VM and taking addresses and dereferencing nodes: #144

haxscramper · 2021-12-29T07:34:14Z

haxscramper
Dec 29, 2021
Maintainer

Related discussion - #113 - lexer also needs to store all the tokens in order for nimpretty to work correctly, and this is best done via DOD

0 replies

saem · 2021-12-30T02:09:08Z

saem
Dec 30, 2021
Maintainer Author

Currently an issue with all these approaches is we don't know how to collect memory for the purposes of nimsuggest.

1 reply

saem Dec 30, 2021
Maintainer Author

I think the answer might be:

mark modules dirty (existing)
build a map of old ids to new ids
a. where new ids are a re-sequenced version to remove the gaps created
copy into a fresh State while using the id map for conversion
then handover this ast to the rest of the compiler

We can take the map building time to build up capacity hints so allocations/sequence resizing is eliminated/kept to a minimum.

I believe this is acceptable because:

we throw a poorly structured heap at refc today doing much the same
access patterns are in our favour:
- compilation discovers and processes modules is by import dependencies
- the dirty file module is marked dirty, then all of its incoming dependencies
- chances are the gaps will be a few contiguous runs instead of lots of fragments
our memory representation is more compact, less total memory and memory bandwidth required

Longer term thinking as to why this might be "right", this approach generalizes to:

mark dirty regions
build old -> new id reference map
copy+map and compact into fresh data structure

Yes this is a form of gc, it's also a form of incremental compilation, see the overlaps with caching? None of these are a conincidence in my opinion. If we improve the compiler to increase precision of invalidation/marking things dirty, then we reduce the amount that needs to be mapped. If we know the amount to be small we can simply take a hit and defer the compaction (out of band). Whether we're doing a suggest query or instantiating generics or whatever, a flavour of this very problem shows up.

saem · 2022-05-24T02:17:45Z

saem
May 24, 2022
Maintainer Author

The sketch PR has been rebased and so far things seem to be mostly working, see: #144

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Oriented Design for the AST #139

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Data Oriented Design for the AST #139

saem Dec 29, 2021 Maintainer

Big Idea

Where to start

What is Data Oriented Design (DOD)?

Replies: 4 comments · 5 replies

haxscramper Dec 29, 2021 Maintainer

haxscramper Dec 29, 2021 Maintainer

saem Dec 30, 2021 Maintainer Author

saem Dec 30, 2021 Maintainer Author

saem Dec 31, 2021 Maintainer Author

haxscramper Dec 29, 2021 Maintainer

saem Dec 30, 2021 Maintainer Author

saem Dec 30, 2021 Maintainer Author

saem May 24, 2022 Maintainer Author

saem
Dec 29, 2021
Maintainer

Replies: 4 comments 5 replies

haxscramper
Dec 29, 2021
Maintainer

haxscramper Dec 29, 2021
Maintainer

saem Dec 30, 2021
Maintainer Author

saem Dec 30, 2021
Maintainer Author

saem Dec 31, 2021
Maintainer Author

haxscramper
Dec 29, 2021
Maintainer

saem
Dec 30, 2021
Maintainer Author

saem Dec 30, 2021
Maintainer Author

saem
May 24, 2022
Maintainer Author