-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Replace EarleyParser with lexeme-based rust implementation #951
Conversation
…into lazy_grammars
This reverts commit c7e5921.
and expected_prompt_tokens[:1] != [engine.tokenizer.bos_token_id] | ||
): | ||
expected_prompt_tokens = [engine.tokenizer.bos_token_id] + expected_prompt_tokens | ||
expected_prompt_tokens = engine.tokenizer.recode(expected_prompt_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@riedgar-ms would love your input on this test, especially regarding recode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect recode()
to have any effect here. Surely a bos_token should always recode to a bos_token no matter what follows it? If not, I would expect there to be some spectacular jail breaks.
Shouldn't you always have expected_prompt_tokens == prompt_tokens_1 == prompt_tokens_2
? Or at most just have to worry about slicing the bos_token off the front of the actual tokens, based on whether or not the engine has a bos_token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try running the test without the recode here -- some llamacpp models seem to need the recode to fix the token(s) immediately following the BOS token. May have something to do with whitespace preceding the first bit of text
And then we have to allow for the tokens that get passed to the model to have one or more tokens sliced off the end because of token healing
expected_prompt_tokens = engine.tokenizer.encode(prompt.encode()) | ||
if ( | ||
engine.tokenizer.bos_token is not None | ||
and expected_prompt_tokens[:1] != [engine.tokenizer.bos_token_id] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some reason this can't be expected_prompt_tokens[0] != engine.tokenizer.bos_token_id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm not here, you are right. This was a copy-and-paste from elsewhere where it DOES matter (where the prompt
string could have been empty).
and expected_prompt_tokens[:1] != [engine.tokenizer.bos_token_id] | ||
): | ||
expected_prompt_tokens = [engine.tokenizer.bos_token_id] + expected_prompt_tokens | ||
expected_prompt_tokens = engine.tokenizer.recode(expected_prompt_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't expect recode()
to have any effect here. Surely a bos_token should always recode to a bos_token no matter what follows it? If not, I would expect there to be some spectacular jail breaks.
Shouldn't you always have expected_prompt_tokens == prompt_tokens_1 == prompt_tokens_2
? Or at most just have to worry about slicing the bos_token off the front of the actual tokens, based on whether or not the engine has a bos_token?
If we're switching to JSON serialisation, do we have a schema defined for that? |
It would be good to have some pydantic schemas defined somewhere. Currently the source of truth is in the rust code here: |
Looks good, great work!! :) Only one main question:
This seems like it would be very surprising since the intention of the first line is clearly to match numbers that end in 7. Seems like an issue arising from the lexeme boundaries. I wonder if this could go away by greedily consuming things after a lexeme as long as the expression remains regular? ...not a blocking issue though, excited to see a merge :) |
Another note:
Yes, that is needed for variable length patterns since you don't want to force the model to stop early. |
Merged -- thanks @hudson-ai, @mmoskal and everyone who took time to review this (@nking-1 @nopdive @riedgar-ms @slundberg) :) |
To preface this all, I want to note that the overall user experience of
guidance
should change minimally if at all.This is primarily a behind-the-scenes change that offers
guidance
llguidance
guidance
(mainly the implementation of the parser and its interactions with the tokenizer)guidance
so that they can be reused in the server-side implementation of theAzureGuidance
endpoints (ensuring that behavior between these remote models and localguidance
-controlled models are as consistent as possible)guidance
Parser
EarleyParser
class that operated at the byte levelTokenParser
class that operates at the token levelLLInterpreter
class fromllguidance
ByteParser
class which wraps aTokenParser
and aByteTokenizer
(a tokenizer with tokens directly corresponding to individual bytes plus aBOS
/EOS
token)match
methodMock
model class?)Engine
Engine
class in favor of directly using the mask from theTokenParser
This PR keeps this behavior, but note THIS LIKELY DOES NOT ALIGN WITH AZURE SERVER-SIDE IMPLEMENTATION. To ensure that this behavior is maintained (maybe worth a discussion), it should be moved into llguidance (@mmoskal)
Serialization
LLInterpreter
underlying theTokenParser
expects JSON-serialized grammarsAzureGuidance
endpointsRemoteEngine
now expects this serialization format (no more protobuf)LLInterpreter
returns JSON-serialized data (not exactly the same as what's returned byAzureGuidance
endpoints, but there is a lot of shared structure)_schema.py
containspydantic
schemas used to validate/parse these response structuresEngineCallResponse
now JSON serialized/validated withpydantic
(no more protobuf)New primitives:
Gen
Lexeme
Subgrammar
RegularGrammar
To understand the new primitives, we need to understand how the new parser is different from the old one. While both are Earley parsers that support general context-free grammars, the smallest "atoms" that the new parser works with are more coarse-grained than the old one. The new parser works with lexemes, while the old one works with bytes.
Roughly, lexemes correspond to regular expressions (and string literals). These are larger chunks of text, making the parser more efficient in a lot of cases. Because lexemes are regular, the lexer (the lexeme sub-parser) can run much more quickly than the outer Earley parser.
While the Earley parser is able to handle ambiguities (e.g
one_or_more("a") + one_or_more(select("a", "b"))
-- which expression is responsible for the second "a" in "aab"?), the lexer can't. We need a deterministic set of rules that tells us how any given string should be lexed (what lexemes are responsible for what parts of the text).Lexemes can be lazy or greedy.
r"a+"
will only ever generate a single "a" andr"a*"
will only ever generate the empty string.r"a+"
can produce as many "a"s as it wants before moving on to the next lexeme.Gen
gen
is composed of two sub-expressions - the "body" regex and the (optional) "stop" regex. If no body regex is passed, it defaults tor"(?s:.*)"
, i.e..*
that additionally matches the newline character.When the stop regex is provided,
gen
behaves as a lazy lexeme. As soon as the full body+stop regex matches the generated text, we exit thegen
(discarding the "stop" text) and move on to the next lexeme. This ensures that gen actually stops when the stop expression is produced.When no
stop
regex is provided, it behaves as a greedy lexeme (with one caveat -- it can be terminated by anEOS
token, which stops with lazy semantics). Note thatregex
is now just an alias forgen
with no stop expression.Examples:
gen(regex=r"[0-9]+") + "xyz"
gen() + "xyz"
r"(?s:.*)"
has not yet failed to match). Note that current guidance ALSO won't force "yz", as the parser will be in a "superposition" that doesn't know whether or not thegen
has completed.gen(regex=r"[0-9]+") + "123"
gen(regex=r"[0-9]+", stop="123")
The subtle changes around EOS should be a fairly small detail .
Lexeme
Not (yet) part of "public" api (available via
guidance._grammar.Lexeme
orguidance.library._subgrammar.lexeme
). Should only really be used when writingSubgrammar
s / translating EBNF.lexeme
to support acontextual
flag (more on that later)Subgrammar
Not (yet) part of "public" api (available via
guidance._grammar.Subgrammar
orguidance.library._subgrammar.subgrammar
). Mostly exists to better support generating programming languages.json
is a subgrammar,json() + "```"
will terminate if a backtick is generated after some valid JSON)"ignore_regex"
kwarg specifies a regular expression that will be "ignored" between lexemes. This can be used to allow flexible whitespace when generating JSON or code, for example.json
has been reimplemented as aSubgrammar
.RegularGrammar
Not (yet) part of "public" api (available via
guidance._grammar.RegularGrammar
orguidance.library._grammar.as_regular_grammar
)NOTE: "manually" building regex-esque grammars should now be discouraged
select(["0", char_range("1", "9") + zero_or_more(char_range("0", "9")])
should be rewritten asregex(r"0|(?:[1-9][0-9]*)")
This is because the lexemes here are individual characters, requiring the expensive Earley parser to run. Rewriting as a
regex
makes the entire grammar into a lexeme, allowing the cheap lexer to do all the work.If directly writing the regex is not possible,
as_regular_grammar
(name subject to change) can wrap a grammar likeselect(["0", char_range("1", "9") + zero_or_more(char_range("0", "9")])
and (try to) convert it into a regex lexeme. Grammars that are not regular will fail this construction.In the future, it would be nice to automatically wrap grammars when we can, preventing users from having to think about this.
Deprecations:
commit_point
(raisesNotImplementedError
, may reimplement in the future?)gen
to support the current "stop" mechanics and in tool callinggen
does not support tool calling (working on this, hopefully will have something working before this PR goes through)Biggest gotchas / changes from current
guidance
regex(r"\d*") + "7"
r"\d"
now matches unicode digitsTODOs
Gen
s orSubgrammar
s