Skip to content

Commit

Permalink
docs: Misc formatting and text fixes, add some internal links
Browse files Browse the repository at this point in the history
docs(Theory): Fix NFA state number errors, spelling, and add alt-texts to images

docs(Regex): fix "See also" links, minor punctuation and grammer fixes

docs(Home): minor grammar fix

docs(Validator): "See also" refs in `_validator` docstrings, improve readability

docs(Tokenizers): mainly formatting fixes, a couple grammar fixes too

docs: link to the simplified FASTA example when it's referred to

docs(Parsing buffers): clarify `onfinal!`, minor grammar fixes

docs(io,custom): minor grammar fixes, linkify reference to other section
  • Loading branch information
digital-carver authored and jakobnissen committed Nov 11, 2023
1 parent 01f608e commit 2d30fac
Show file tree
Hide file tree
Showing 13 changed files with 66 additions and 56 deletions.
8 changes: 4 additions & 4 deletions docs/src/custom.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ end

# Customizing Automa's code generation
Automa offers a few ways of customising the created code.
Note that the precise code generated by automa is considered an implementation detail,
Note that the precise code generated by Automa is considered an implementation detail,
and as such is subject to change without warning.
Only the overall behavior, i.e. the "DFA simulation" can be considered stable.

Nonetheless, it is instructive to look at the code generated for the machine in the "parsing from a buffer" section.
Nonetheless, it is instructive to look at the code generated for the machine in the "parsing from a buffer" [section](@ref "Creating our parser").
I present it here cleaned up and with comments for human inspection.

```julia
Expand Down Expand Up @@ -142,7 +142,7 @@ end
```

The first improvement is to the algorithm itself: Instead of of parsing to a vector of `Seq`,
The first improvement is to the algorithm itself: Instead of parsing to a vector of `Seq`,
I'm simply going to index the input data, filling up an existing vector of:

```jldoctest custom1; output = false
Expand Down Expand Up @@ -205,4 +205,4 @@ Now the code parses the same 45 MB FASTA file in 11.14 miliseconds, parsing at a
```@docs
Automa.CodeGenContext
Automa.Variables
```
```
2 changes: 1 addition & 1 deletion docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ They're structured like a tutorial, beginning from the simplest use of Automa an
If you like to dive straight in, you might want to start by reading the examples below, then go through the examples in the `examples/` directory in the Automa repository.

## Examples
### Validate some text only is composed of ASCII alphanumeric characters
### Validate some text is composed only of ASCII alphanumeric characters
```jldoctest; output = false
generate_buffer_validator(:validate_alphanumeric, re"[a-zA-Z0-9]*") |> eval
Expand Down
2 changes: 1 addition & 1 deletion docs/src/io.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ false

!!! danger
The following code is only for demonstration purposes.
It has several one important flaw, which will be adressed in a later section, so do not copy-paste it for serious work.
It has one important flaw, which will be addressed in a later section, so do not copy-paste it for serious work.

There are a few more subtleties related to the `generate_reader` function.
Suppose we instead want to create a function that reads a single FASTA record from an IO.
Expand Down
12 changes: 6 additions & 6 deletions docs/src/parser.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ end
```

# Parsing from a buffer
Automa can leverage metaprogramming to combine regex and julia code to create parsers.
Automa can leverage metaprogramming to combine regex and Julia code to create parsers.
This is significantly more difficult than simply using validators or tokenizers, but still simpler than parsing from an IO.
Currently, Automa loads data through pointers, and therefore needs data backed by `Array{UInt8}` or `String` or similar - it does not work with types such as `UnitRange{UInt8}`.
Furthermore, be careful about passing strided views to Automa - while Automa can extract a pointer from a strided view, it will always advance the pointer one byte at a time, disregarding the view's stride.

As an example, let's use the simplified FASTA format intoduced in the regex section, with the following format: `re"(>[a-z]+\n([ACGT]+\n)+)*"`.
As an example, let's use the simplified FASTA format [introduced in the regex section](@ref fasta_example), with the following format: `re"(>[a-z]+\n([ACGT]+\n)+)*"`.
We want to parse it into a `Vector{Seq}`, where `Seq` is defined as:

```jldoctest parse1
Expand All @@ -32,7 +32,7 @@ Currently, actions can be added in the following places in a regex:
* With `onenter!`, meaning it will be executed when reading the first byte of the regex
* With `onfinal!`, where it will be executed when reading the last byte of the regex.
Note that it's not possible to determine the final byte for some regex like `re"X+"`, since
the machine reads only 1 byte at a time and cannot look ahead.
the machine reads only 1 byte at a time and cannot look ahead. In such cases, an error is raised.
* With `onexit!`, meaning it will be executed on reading the first byte AFTER the regex,
or when exiting the regex by encountering the end of inputs (only for a regex match, not an unexpected end of input)
* With `onall!`, where it will be executed when reading every byte that is part of the regex.
Expand All @@ -56,10 +56,10 @@ julia> my_regex isa RE
true
```

When the the following regex' actions are visualized in its corresponding [DFA](theory.md#deterministic-finite-automata):
When the the following regex's actions are visualized in its corresponding [DFA](theory.md#deterministic-finite-automata):

```julia
regex =
regex = let
ab = re"ab*"
onenter!(ab, :enter_ab)
onexit!(ab, :exit_ab)
Expand Down Expand Up @@ -215,7 +215,7 @@ ERROR: Ambiguous NFA.
```

Why does this error? Well, remember that Automa processes one byte at a time, and at each byte, makes a decision on what actions to execute.
Hence, if it sees the input `>a\nA\n`, it does not know what to do when encountering the second `\n`. If the next byte e,g. `A`, then it would need to execute the `:seqline` action. If the byte is `>`, it would need to execute first `:seqline`, then `:record`.
Hence, if it sees the input `>a\nA\n`, it does not know what to do when encountering the second `\n`. If the next byte is, e,g. `A`, then it would need to execute the `:seqline` action. If the byte is `>`, it would need to execute first `:seqline`, then `:record`.
Automa can't read ahead, so, the regex is ambiguous and the true behaviour when reading the inputs `>a\nA\n` is undefined.
Therefore, Automa refuses to compile it.

Expand Down
4 changes: 2 additions & 2 deletions docs/src/reader.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Hence, while a B record can appear at any time, once you've seen a B record, the
When reading records from the file, you must be able to store whether you've seen a B record.

We address this by creating a `Reader` type which wraps the IO being parsed, and which store any state we want to preserve between records.
Let's stick to our simplified FASTA format parsing sequences into `Seq` objects:
Let's stick to our [simplified FASTA format](@ref fasta_example) parsing sequences into `Seq` objects:

```jldoctest reader1; output = false
struct Seq
Expand Down Expand Up @@ -120,4 +120,4 @@ Seq("tag", "GAGATATA")
julia> read_record(reader)
ERROR: EOFError: read end of file
```
```
14 changes: 7 additions & 7 deletions docs/src/regex.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,15 @@ They are made using the `@re_str` macro, like this: `re"ABC[DEF]"`.
Automa regex matches individual bytes, not characters. Hence, `re"Æ"` (with the UTF-8 encoding `[0xc3, 0x86]`) is equivalent to `re"\xc3\x86"`, and is considered the concatenation of two independent input bytes.

The `@re_str` macro supports the following content:
* Literal symbols, such as `re"ABC"`, `re"\xfe\xa2"` or `re"Ø"`
* Literal symbols, such as `re"ABC"`, `re"\xfe\xa2"` or `re"Ø"`.
* `|` for alternation, as in `re"A|B"`, meaning "`A` or `B`".
* Byte sets with `[]`, like `re"[ABC]"`.
This means any of the bytes in the brackets, e.g. `re"[ABC]"` is equivalent to `re"A|B|C"`.
* Inverted byte sets, e.g. `re"[^ABC]"`, meaning any byte, except those in `re[ABC]`.
* Repetition, with `X*` meaning zero or more repetitions of X
* `+`, where `X+` means `XX*`, i.e. 1 or more repetitions of X
* `?`, where `X?` means `X | ""`, i.e. 0 or 1 occurrences of X. It applies to the last element of the regex
* Parentheses to group expressions, like in `A(B|C)?`
* Repetition, with `X*` meaning zero or more repetitions of X.
* `+`, where `X+` means `XX*`, i.e. 1 or more repetitions of X.
* `?`, where `X?` means `X | ""`, i.e. 0 or 1 occurrences of X. It applies to the last element of the regex.
* Parentheses to group expressions, like in `re"A(B|C)?"`.

You can combine regex with the following operations:
* `*` for concatenation, with `re"A" * re"B"` being the same as `re"AB"`.
Expand All @@ -33,9 +33,9 @@ You can combine regex with the following operations:
* `!` for inversion, such that `!re"[A-Z]"` matches all other strings than those which match `re"[A-Z]"`.
Note that `!re"a"` also matches e.g. `"aa"`, since this does not match `re"a"`.

Finally, the funtions `opt`, `rep` and `rep1` is equivalent to the operators `?`, `*` and `+`, so i.e. `opt(re"a" * rep(re"b") * re"c")` is equivalent to `re"(ab*c)?"`.
Finally, the functions `opt`, `rep` and `rep1` are equivalent to the operators `?`, `*` and `+`, so for eg. `opt(re"a" * rep(re"b") * re"c")` is equivalent to `re"(ab*c)?"`.

## Example
## [Example](@id fasta_example)
Suppose we want to create a regex that matches a simplified version of the FASTA format.
This "simple FASTA" format is defined like so:

Expand Down
40 changes: 20 additions & 20 deletions docs/src/theory.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,32 +43,32 @@ or labeled with one or more input symbols, in which the machine may traverse the

To illustrate, let's look at one of the simplest regex: `re"a"`, matching the letter `a`:

![](figure/simple.png)
![State diagram showing state 1, edge transition consuming input 'a', leading to "accept state" 2](figure/simple.png)

You begin at the small dot on the right, then immediately go to state 1, the cirle marked by a `1`.
You begin at the small dot on the right, then immediately go to state 1, the circle marked by a `1`.
By moving to the next state, state 2, you consume the next symbol from the input string, which must be the symbol marked on the edge from state 1 to state 2 (in this case, an `a`).
Some states are "accept states", illustrated by a double cirle. If you are at an accept state when you've consumed all symbols of the input string, the string matches the regex.
Some states are "accept states", illustrated by a double circle. If you are at an accept state when you've consumed all symbols of the input string, the string matches the regex.

Each of the operaitons that combine regex can also combine NFAs.
Each of the operations that combine regex can also combine NFAs.
For example, given the two regex `a` and `b`, which correspond to the NFAs `A` and `B`, the regex `a * b` can be expressed with the following NFA:

![](figure/cat.png)
![State diagram showing ϵ transition from state A to accept state B](figure/cat.png)

Note the `ϵ` symbol on the edge - this signifies an "epsilon transition", meaning you move directly from `A` to `B` without consuming any symbols.

Similarly, `a | b` correspond to this NFA structure...

![](figure/alt.png)
![State diagram of the NFA for `a | b`](figure/alt.png)

...and `a*` to this:

![](figure/kleenestar.png)
![State diagram of the NFA for `a*`](figure/kleenestar.png)

For a larger example, `re"(\+|-)?(0|1)*"` combines alternation, concatenation and repetition and so looks like this:

![](figure/larger.png)
![State diagram of the NFA for `re"(\+|-)?(0|1)*"`](figure/larger.png)

ϵ-transitions means that there are states from which there are multiple possible next states, e.g. in the larger example above, state 1 can lead to state 2 or state 12.
ϵ-transitions means that there are states from which there are multiple possible next states, e.g. in the larger example above, state 1 can lead to state 2 or state 8.
That's what makes NFAs nondeterministic.

In order to match a regex to a string then, the movement through the NFA must be emulated.
Expand All @@ -78,12 +78,12 @@ If an ϵ-edge is encountered from state `A` that leads to states `B` and `C`, th

For example, if the regex `re"(\+|-)?(0|1)*` visualized above is matched to the string `-11`, this is what happens:
* NFA starts in state 1
* NFA immediately moves to all states reachable via ϵ transition. It is now in state {3, 5, 7, 9, 10}.
* NFA sees input `-`. States {5, 7, 9, 10} do not have an edge with `-` leading out, so these states die.
* NFA immediately moves to all states reachable via ϵ transition. It is now in state {2, 3, 5, 7, 8, 9, 10}.
* NFA sees input `-`. States {2, 3, 4, 5, 7, 8, 10} do not have an edge with `-` leading out, so these states die.
Therefore, the machine is in state 9, consumes the input, and moves to state 2.
* NFA immediately moves to all states reachable from state 2 via ϵ transitions, so goes to {3, 5, 7}
* NFA sees input `1`, must be in state 5, moves to state 6, then through ϵ transitions to state {3, 5, 7}
* The above point repeats, NFA is still in state {3, 5, 7}
* NFA immediately moves to all states reachable from state 2 via ϵ transitions, so goes to {3, 4, 5, 7}
* NFA sees input `1`, must be in state 5, moves to state 6, then through ϵ transitions to state {3, 4, 5, 7}
* The above point repeats, NFA is still in state {3, 4, 5, 7}
* Input ends. Since state 3 is an accept state, the string matches.

Using only a regex-to-NFA converter, you could create a simple regex engine simply by emulating the NFA as above.
Expand All @@ -97,19 +97,19 @@ In other words, every input symbol _must_ trigger one unambiguous state transiti

Let's visualize the DFA equivalent to the larger NFA above:

![](figure/large_dfa.png)
![State diagram of the DFA for `re"(\+|-)?(0|1)*"`](figure/large_dfa.png)

It might not be obvious, but the DFA above accepts exactly the same inputs as the previous NFA.
DFAs are way simpler to simulate in code than NFAs, precisely because at every state, for every input, there is exactly one action.
DFAs can be simulated either using a lookup table, of possible state transitions,
DFAs can be simulated either using a lookup table of possible state transitions,
or by hardcoding GOTO-statements from node to node when the correct input is matched.
Code simulating DFAs can be ridicuously fast, with each state transition taking less than 1 nanosecond, if implemented well.

Furthermore, DFAs can be optimised.
Two edges between the same nodes with labels `A` and `B` can be collapsed to a single edge with labels `[AB]`, and redundant nodes can be collapsed.
The optimised DFA equivalent to the one above is simply:

![](figure/large_machine.png)
![State diagram of the simpler DFA for `re"(\+|-)?(0|1)*"`](figure/large_machine.png)

Unfortunately, as the name "powerset construction" hints, convering an NFA with N nodes may result in a DFA with up to 2^N nodes.
This inconvenient fact drives important design decisions in regex implementations.
Expand All @@ -130,13 +130,13 @@ If the cache is flushed too often, it falls back to simulating the NFA directly.
Such an approach is necessary for `ripgrep`, because the regex -> NFA -> DFA compilation happens at runtime and must be near-instantaneous, unlike Automa, where it happens during package precompilation and can afford to be slow.

## Automa in a nutshell
Automa simulates the DFA by having the DFA create a Julia Expr, which is then used to generate a Julia function using metaprogramming.
Automa simulates the DFA by having the DFA create a Julia `Expr`, which is then used to generate a Julia function using metaprogramming.
Like all other Julia code, this function is then optimized by Julia and then LLVM, making the DFA simulations very fast.

Because Automa just constructs Julia functions, we can do extra tricks that ordinary regex engines cannot:
We can splice arbitrary Julia code into the DFA simulation.
Currently, Automa supports two such kinds of code: _actions_, and _preconditions_.

Actions are Julia code that is executed during certain state transitions.
Preconditions are Julia code, that evaluates to a `Bool` value, and which is checked before a state transition.
If it evaluates to `false`, the transition is not taken.
Preconditions are Julia code, that evaluates to a `Bool` value, and which are checked before a state transition.
If a precondition evaluates to `false`, the transition is not taken.
4 changes: 2 additions & 2 deletions docs/src/tokenizer.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Breaking the text down to its tokens is called tokenization or lexing.
Note that lexing in itself is not sufficient to parse the format:
Lexing is _context unaware_ and doesn't understand syntax, so e.g. the test `"((A` can be perfectly well tokenized to `quote lparens lparens A`, even if it's invalid syntax.

The purpose of tokenization is to make subsequent parsing easier, because each part of the text has been classified. That makes it easier to, for example, to search for letters in the input.
The purpose of tokenization is to make subsequent parsing easier, because each part of the text has been classified. That makes it easier to, for example, search for letters in the input.
Instead of having to muck around with regex to find the letters, you use regex once to classify all text.

## Making and using a tokenizer
Expand Down Expand Up @@ -87,7 +87,7 @@ julia> collect(tokenize(UInt32, "XY!!)"))
(5, 1, 0x00000002)
```

Both `tokenize` and `make_tokenizer` takes an optional argument `version`, which is `1` by default.
Both `tokenize` and `make_tokenizer` take an optional argument `version`, which is `1` by default.
This sets the last parameter of the `Tokenizer` struct - for example, `make_tokenizer(tokens::Vector{RE}; version=5)`
defines `Base.iterate` for `Tokenizer{UInt32, D, 5}`.

Expand Down
2 changes: 1 addition & 1 deletion docs/src/validators.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The simplest use of Automa is to simply match a regex.
It's unlikely you are going to want to use Automa for this instead of Julia's built-in regex engine PCRE, unless you need the extra performance that Automa brings over PCRE.
Nonetheless, it serves as a good starting point to introduce Automa.

Suppose we have the FASTA regex from the regex page:
Suppose we have the FASTA regex [from the regex page](@ref fasta_example):

```jldoctest val1
julia> fasta_regex = let
Expand Down
5 changes: 4 additions & 1 deletion src/codegen.jl
Original file line number Diff line number Diff line change
Expand Up @@ -120,8 +120,11 @@ Generate code that, when evaluated, defines a function named `name`, which takes
single argument `data`, interpreted as a sequence of bytes.
The function returns `nothing` if `data` matches `Machine`, else the index of the first
invalid byte. If the machine reached unexpected EOF, returns `0`.
If `goto`, the function uses the faster but more complicated `:goto` code.
If `goto`, the function uses the faster but more complicated `:goto` code.\\
If `docstring`, automatically create a docstring for the generated function.
See also: [`generate_io_validator`](@ref)
"""
function generate_buffer_validator(
name::Symbol,
Expand Down
2 changes: 1 addition & 1 deletion src/re.jl
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ julia> compile(regex) isa Automa.Machine
true
```
See also: `[@re_str](@ref)`, `[@compile](@ref)`
See also: [`@re_str`](@ref), [`compile`](@ref Main.Automa.compile)
"""
mutable struct RE
head::Symbol
Expand Down
5 changes: 4 additions & 1 deletion src/stream.jl
Original file line number Diff line number Diff line change
Expand Up @@ -108,10 +108,13 @@ to the regex, without executing any actions.
If the input conforms, return `nothing`.
Else, return `(byte, (line, col))`, where `byte` is the first invalid byte,
and `(line, col)` the 1-indexed position of that byte.
If the invalid byte is a `\n` byte, `col` is 0 and the line number is incremented.
If the invalid byte is a `\\n` byte, `col` is 0 and the line number is incremented.
If the input errors due to unexpected EOF, `byte` is `nothing`, and the line and column
given is the last byte in the file.
If `goto`, the function uses the faster but more complicated `:goto` code.
See also: [`generate_buffer_validator`](@ref)
"""
function generate_io_validator(
funcname::Symbol,
Expand Down
Loading

0 comments on commit 2d30fac

Please sign in to comment.