[WIP] Replace EarleyParser with lexeme-based rust implementation #951

hudson-ai · 2024-07-16T17:57:45Z

To preface this all, I want to note that the overall user experience of guidance should change minimally if at all.
This is primarily a behind-the-scenes change that offers

Speedups (sometimes 6x, at least on my machine!) over the current implementation of guidance
Some new pieces of plumbing (not yet a part of the public API) that aim to simplify development of programming language (PL) grammars
Simplification of the core LLM-parser loop that makes our codebase a little less scary

`llguidance`

llguidance is a library designed to handle the "low level" components of guidance (mainly the implementation of the parser and its interactions with the tokenizer)
These components have been pulled out of guidance so that they can be reused in the server-side implementation of the AzureGuidance endpoints (ensuring that behavior between these remote models and local guidance-controlled models are as consistent as possible)

`guidance`

Parser

Removed EarleyParser class that operated at the byte level
- Byte-by-byte consumption (usage: search token trie, using parser to validate acceptable tokens)
- Parse-tree built over individual bytes
Introduced TokenParser class that operates at the token level
- Token-by-token consumption (usage: directly compute token mask)
- Parse-tree built over lexemes (more on these later, but larger chunks of work than bytes)
- Still an Earley parser under the hood
- Light wrapper around rust-based LLInterpreter class from llguidance
Introduced ByteParser class which wraps a TokenParser and a ByteTokenizer (a tokenizer with tokens directly corresponding to individual bytes plus a BOS/EOS token)
- Only used to replicate old byte-by-byte consumption and to give grammars a match method
- These features only used in test suite (attempted to make changes to tests as minimal as possible)
- Hope to deprecate this class in the future (important bits could maybe be subsumed into Mock model class?)

Engine

Removed token-trie search from Engine class in favor of directly using the mask from the TokenParser
A lot of code got deleted, hopefully making this implementation a lot less scary to newcomers (even if the scary bits just got pushed down into rust...)
Current guidance allows grammars that are in an "accepting state" to sample a token without constraints. If the sampled token is not accepted by the grammar, then it will be treated as an EOS token and terminate the grammar (@Harsha-Nori @slundberg maybe one of you can confirm that I have this right?).
This PR keeps this behavior, but note THIS LIKELY DOES NOT ALIGN WITH AZURE SERVER-SIDE IMPLEMENTATION. To ensure that this behavior is maintained (maybe worth a discussion), it should be moved into llguidance (@mmoskal)

Serialization

The LLInterpreter underlying the TokenParser expects JSON-serialized grammars
- This serialization format is consistent with the format expected by remote AzureGuidance endpoints
- The RemoteEngine now expects this serialization format (no more protobuf)
LLInterpreter returns JSON-serialized data (not exactly the same as what's returned by AzureGuidance endpoints, but there is a lot of shared structure)
- New file _schema.py contains pydantic schemas used to validate/parse these response structures
EngineCallResponse now JSON serialized/validated with pydantic (no more protobuf)

New primitives:

Gen
Lexeme
Subgrammar
RegularGrammar

To understand the new primitives, we need to understand how the new parser is different from the old one. While both are Earley parsers that support general context-free grammars, the smallest "atoms" that the new parser works with are more coarse-grained than the old one. The new parser works with lexemes, while the old one works with bytes.

Roughly, lexemes correspond to regular expressions (and string literals). These are larger chunks of text, making the parser more efficient in a lot of cases. Because lexemes are regular, the lexer (the lexeme sub-parser) can run much more quickly than the outer Earley parser.

While the Earley parser is able to handle ambiguities (e.g one_or_more("a") + one_or_more(select("a", "b")) -- which expression is responsible for the second "a" in "aab"?), the lexer can't. We need a deterministic set of rules that tells us how any given string should be lexed (what lexemes are responsible for what parts of the text).

Lexemes can be lazy or greedy.

Lazy:
- Completes as early as possible
- A lazy lexeme will complete as soon as it matches, e.g. a lazy lexeme r"a+" will only ever generate a single "a" and r"a*" will only ever generate the empty string.
Greedy:
- Completes as late as possible
- A greedy lexeme will not complete until it fails to match, e.g. a greedy lexeme r"a+" can produce as many "a"s as it wants before moving on to the next lexeme.

Gen

gen is composed of two sub-expressions - the "body" regex and the (optional) "stop" regex. If no body regex is passed, it defaults to r"(?s:.*)", i.e. .* that additionally matches the newline character.

When the stop regex is provided, gen behaves as a lazy lexeme. As soon as the full body+stop regex matches the generated text, we exit the gen (discarding the "stop" text) and move on to the next lexeme. This ensures that gen actually stops when the stop expression is produced.

When no stop regex is provided, it behaves as a greedy lexeme (with one caveat -- it can be terminated by an EOS token, which stops with lazy semantics). Note that regex is now just an alias for gen with no stop expression.

Examples:

gen(regex=r"[0-9]+") + "xyz"
- No stop provided, so the gen is greedy.
- Must produce at least one digit, after which it is allowed to either produce more digits, the EOS token, or the letter "x".
- If it produces "x", then the string "yz" will be forced.
- If it produces the EOS token (or hits max_tokens), then the string "xyz" will be forced (resulting string will NOT have the EOS -- it will be dropped).
gen() + "xyz"
- No stop provided, so the gen is greedy.
- Implicitly allowed to generate anything.
- Generating "x" will NOT force "yz", since generating "x" does not terminate the gen (r"(?s:.*)" has not yet failed to match). Note that current guidance ALSO won't force "yz", as the parser will be in a "superposition" that doesn't know whether or not the gen has completed.
- Only an EOS (or hitting max_tokens) can terminate the gen
- GOTCHA (difference from current guidance):
  - Current guidance: any string ending in "xyz" is allowed to terminate the grammar (i.e. an EOS is allowed but not forced). In practice, this probably won't happen often.
  - This PR: the gen must first be terminated, at which point an "xyz" will be forced and the overall grammar will terminate
gen(regex=r"[0-9]+") + "123"
- Greedy
- Generating "1" will not force "23"...
- (Same "gotcha" as above): only way to complete this expression is for the model to generate at least one number then the EOS (or it hits the token limit), at which poing "123" will be forced
gen(regex=r"[0-9]+", stop="123")
- Lazy
- Will produce any number of digits, terminating as soon as "123" is generated (e.g. "2346452341412134123")

The subtle changes around EOS should be a fairly small detail .

Lexeme

Not (yet) part of "public" api (available via guidance._grammar.Lexeme or guidance.library._subgrammar.lexeme). Should only really be used when writing Subgrammars / translating EBNF.

Consist of a single regular expression
They are always greedy.
Model not allowed EOS as an "out" (unless we are at the end of a grammar, of course)
TODO: lexeme to support a contextual flag (more on that later)

Subgrammar

Not (yet) part of "public" api (available via guidance._grammar.Subgrammar or guidance.library._subgrammar.subgrammar). Mostly exists to better support generating programming languages.

Wraps a guidance grammar, which will then be treated as "atomic"/"terminal" from the perspective of the outer grammar's Earley parser (i.e. treated as a greedy lexeme).
Can be terminated by an EOS or by generating non-matching string (e.g. if json is a subgrammar, json() + "```" will terminate if a backtick is generated after some valid JSON)
"ignore_regex" kwarg specifies a regular expression that will be "ignored" between lexemes. This can be used to allow flexible whitespace when generating JSON or code, for example.
Non-contextual lexemes given priority whenever any lexeme is being generated (e.g. to support keywords in PLs)
Note: json has been reimplemented as a Subgrammar.

RegularGrammar

Not (yet) part of "public" api (available via guidance._grammar.RegularGrammar or guidance.library._grammar.as_regular_grammar)

NOTE: "manually" building regex-esque grammars should now be discouraged

e.g. select(["0", char_range("1", "9") + zero_or_more(char_range("0", "9")]) should be rewritten as regex(r"0|(?:[1-9][0-9]*)")

This is because the lexemes here are individual characters, requiring the expensive Earley parser to run. Rewriting as a regex makes the entire grammar into a lexeme, allowing the cheap lexer to do all the work.

If directly writing the regex is not possible, as_regular_grammar (name subject to change) can wrap a grammar like select(["0", char_range("1", "9") + zero_or_more(char_range("0", "9")]) and (try to) convert it into a regex lexeme. Grammars that are not regular will fail this construction.

In the future, it would be nice to automatically wrap grammars when we can, preventing users from having to think about this.

Deprecations:

commit_point (raises NotImplementedError, may reimplement in the future?)
- was only used in gen to support the current "stop" mechanics and in tool calling
gen does not support tool calling (working on this, hopefully will have something working before this PR goes through)

Biggest gotchas / changes from current `guidance`

regex(r"\d*") + "7"
- Current guidance is allowed to emit an EOS after any sequence of digits ending in "7"
- Under this PR, guidance is allowed to emit an EOS after any sequence of digits, at which point a "7" will be forced
r"\d" now matches unicode digits

TODOs

server-side engine variables
- EOS can't be explicitly referenced in the grammar, only implicitly at the end of Gens or Subgrammars
stopping gen on active role end
- Should be more trivial with server-side engine variables

…into lazy_grammars

…og_level)

… for now)

This reverts commit c7e5921.

hudson-ai · 2024-08-17T01:50:27Z

tests/model_integration/test_model.py

+        and expected_prompt_tokens[:1] != [engine.tokenizer.bos_token_id]
+    ):
+        expected_prompt_tokens = [engine.tokenizer.bos_token_id] + expected_prompt_tokens
+        expected_prompt_tokens = engine.tokenizer.recode(expected_prompt_tokens)


@riedgar-ms would love your input on this test, especially regarding recode.

I wouldn't expect recode() to have any effect here. Surely a bos_token should always recode to a bos_token no matter what follows it? If not, I would expect there to be some spectacular jail breaks.

Shouldn't you always have expected_prompt_tokens == prompt_tokens_1 == prompt_tokens_2? Or at most just have to worry about slicing the bos_token off the front of the actual tokens, based on whether or not the engine has a bos_token?

Try running the test without the recode here -- some llamacpp models seem to need the recode to fix the token(s) immediately following the BOS token. May have something to do with whitespace preceding the first bit of text

And then we have to allow for the tokens that get passed to the model to have one or more tokens sliced off the end because of token healing

riedgar-ms · 2024-08-19T17:28:50Z

tests/model_integration/test_model.py

+    expected_prompt_tokens = engine.tokenizer.encode(prompt.encode())
+    if (
+        engine.tokenizer.bos_token is not None
+        and expected_prompt_tokens[:1] != [engine.tokenizer.bos_token_id]


Is there some reason this can't be expected_prompt_tokens[0] != engine.tokenizer.bos_token_id ?

Mmm not here, you are right. This was a copy-and-paste from elsewhere where it DOES matter (where the prompt string could have been empty).

riedgar-ms · 2024-08-19T17:55:15Z

tests/model_integration/test_model.py

+        and expected_prompt_tokens[:1] != [engine.tokenizer.bos_token_id]
+    ):
+        expected_prompt_tokens = [engine.tokenizer.bos_token_id] + expected_prompt_tokens
+        expected_prompt_tokens = engine.tokenizer.recode(expected_prompt_tokens)


I wouldn't expect recode() to have any effect here. Surely a bos_token should always recode to a bos_token no matter what follows it? If not, I would expect there to be some spectacular jail breaks.

Shouldn't you always have expected_prompt_tokens == prompt_tokens_1 == prompt_tokens_2? Or at most just have to worry about slicing the bos_token off the front of the actual tokens, based on whether or not the engine has a bos_token?

riedgar-ms · 2024-08-19T18:10:14Z

If we're switching to JSON serialisation, do we have a schema defined for that?

hudson-ai · 2024-08-19T21:04:13Z

If we're switching to JSON serialisation, do we have a schema defined for that?

It would be good to have some pydantic schemas defined somewhere. Currently the source of truth is in the rust code here:
https://github.com/microsoft/llguidance/blob/0201ad6be76bf15302c3d86bc649ab812840afe8/parser/src/api.rs#L4
(tag @mmoskal)

slundberg · 2024-08-19T23:42:42Z

Looks good, great work!! :)

Only one main question:

regex(r"\d*") + "7"
Current guidance is allowed to emit an EOS after any sequence of digits ending in "7"
Under this PR, guidance is allowed to emit an EOS after any sequence of digits, at which point a "7" will be forced

This seems like it would be very surprising since the intention of the first line is clearly to match numbers that end in 7. Seems like an issue arising from the lexeme boundaries. I wonder if this could go away by greedily consuming things after a lexeme as long as the expression remains regular?

...not a blocking issue though, excited to see a merge :)

slundberg · 2024-08-19T23:45:20Z

Another note:

Current guidance allows grammars that are in an "accepting state" to sample a token without constraints. If the sampled token is not accepted by the grammar, then it will be treated as an EOS token and terminate the grammar (@Harsha-Nori @slundberg maybe one of you can confirm that I have this right?).

Yes, that is needed for variable length patterns since you don't want to force the model to stop early.

Harsha-Nori · 2024-08-20T00:25:54Z

Merged -- thanks @hudson-ai, @mmoskal and everyone who took time to review this (@nking-1 @nopdive @riedgar-ms @slundberg) :)

Workaround #1003 and #1004 by marking three tests as XFAIL. The failures appear related to #951

hudson-ai and others added 30 commits June 14, 2024 14:06

Initial LLParser class

446ecc5

Consolidate progress parsing (todo: namespace this better?)

25f2162

Add BOS token to ByteTokenizer

1a14ff9

Capture progress from final mid_process

0a86ceb

Make Mock use ByteTokenizer

b5e7c99

Keep literal Nulls in Select

374289b

serialize Nulls

ad60a00

Decode bytes for HF tokenizer encode

1bad07c

Remove assert of deprecated trie

da2d904

support for commit_point(grammar)

7ad5bda

Merge branch 'lazy_grammars' of https://github.com/hudson-ai/guidance …

b2dafea

…into lazy_grammars

Move start to _start, call from __init__

17725d3

Better encapsulate parser statefulness

9654e36

Valid next tokens

09ce4d3

Rough impl of ByteParser

a3e0b5b

ByteParser for grammar.match

9c2514c

Nulls don't have temperature

2c635e4

Improve consume_bytes logic

4aee0fa

simplify grammar.match

000ccdb

ll_interpreter

458013a

Better error case in consume_bytes

4975ddc

Slice self.bytes so exception is correct

2914977

Off by one error

96e3a17

int -> bytes

ae1baa6

Explicitly pass kwargs to EngineCallResponse

077e4b9

matched

ede3333

Add some comments

ed4b220

Merge branch 'main' into lazy_grammars

d26536c

done is callable now

6f98db5

LLGUIDANCE_LOG_LEVEL

7848c09

hudson-ai and others added 20 commits July 29, 2024 17:58

Test multiple tool calls

c3f128f

Simplify test

da2cf45

rename tests so pytest -k works

f922ccc

fix duplicate warning printing and add request data logging (at 4-5 l…

5fa82db

…og_level)

Merge branch 'main' into lazy_grammars

f61d620

add xfail for now (covered by llguidance issue #7)

3520114

add more explicit test for list append with no explicit stop (xfailed…

e6260c2

… for now)

Merge branch 'main' into lazy_grammars

030d019

transformers tokenizer takes bytes

ebb042a

parentheses...

ef0e182

add some tests for hitting kv cache (failing)

c7e5921

Revert "add some tests for hitting kv cache (failing)"

bf86031

This reverts commit c7e5921.

covariant return types

f6937c5

recode prompt after token healing / adding BOS

cf18177

test an associative property on how prompts interact with grammars

9e356e0

recode prompt in the test

362278a

slice tokens to account for token healing

0c9df5f

allow more tokens in test

9294070

Prompt to encourage shorter output

ffffa5c

rejigger prompt some more...

6cae4fa

hudson-ai commented Aug 17, 2024

View reviewed changes

riedgar-ms reviewed Aug 19, 2024

View reviewed changes

Harsha-Nori merged commit 17823bf into guidance-ai:main Aug 20, 2024
100 checks passed

riedgar-ms mentioned this pull request Aug 30, 2024

[Build] CI & Notebook Workaround #1005

Merged

riedgar-ms added a commit that referenced this pull request Aug 30, 2024

[Build] CI & Notebook Workaround (#1005)

3bf3d14

Workaround #1003 and #1004 by marking three tests as XFAIL. The failures appear related to #951

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Replace EarleyParser with lexeme-based rust implementation #951

[WIP] Replace EarleyParser with lexeme-based rust implementation #951

hudson-ai commented Jul 16, 2024 •

edited

Loading

hudson-ai Aug 17, 2024

riedgar-ms Aug 19, 2024

hudson-ai Aug 19, 2024

riedgar-ms Aug 19, 2024

hudson-ai Aug 19, 2024

riedgar-ms Aug 19, 2024

riedgar-ms commented Aug 19, 2024

hudson-ai commented Aug 19, 2024

slundberg commented Aug 19, 2024

slundberg commented Aug 19, 2024

Harsha-Nori commented Aug 20, 2024

[WIP] Replace EarleyParser with lexeme-based rust implementation #951

[WIP] Replace EarleyParser with lexeme-based rust implementation #951

Conversation

hudson-ai commented Jul 16, 2024 • edited Loading

llguidance

guidance

Parser

Engine

Serialization

New primitives:

Gen

Lexeme

Subgrammar

RegularGrammar

Deprecations:

Biggest gotchas / changes from current guidance

TODOs

hudson-ai Aug 17, 2024

Choose a reason for hiding this comment

riedgar-ms Aug 19, 2024

Choose a reason for hiding this comment

hudson-ai Aug 19, 2024

Choose a reason for hiding this comment

riedgar-ms Aug 19, 2024

Choose a reason for hiding this comment

hudson-ai Aug 19, 2024

Choose a reason for hiding this comment

riedgar-ms Aug 19, 2024

Choose a reason for hiding this comment

riedgar-ms commented Aug 19, 2024

hudson-ai commented Aug 19, 2024

slundberg commented Aug 19, 2024

slundberg commented Aug 19, 2024

Harsha-Nori commented Aug 20, 2024

hudson-ai commented Jul 16, 2024 •

edited

Loading

`llguidance`

`guidance`

Biggest gotchas / changes from current `guidance`