Can backtracking be avoided? #281
-
I am working on a grammar where there is ambiguity between identifiers and keywords, and some keywords consist of several identifiers separated by spaces (to make the DSL more user friendly and readable in English). So, for instance I have a keyword that is "required option", which could be parsed as either two identifiers, or two keywords which are later combined with an if_must<Key1, Space+, Key2, ...> - and I have some trouble with spurious actions triggering and need to backtrack to pop the state and remove the falsely matched identifiers. I have read https://github.com/taocpp/PEGTL/blob/main/doc/Actions-and-States.md#backtracking and the proposed solutions are not quite satisfying, but the section "Looking Ahead" made me think: why can't we parse the input twice ? Once just to figure out what everything is, and the second time to just trigger the actions, with perfect foreknowledge? Yes, that will require the parser to allocate some memory to keep this state, but it should be fairly compact, probably an int or so for each terminal and nonterminal. In a way, this would create an AST - and I read the documentation on the PEGTL's AST implementation but I'm not satisfied with the way it forces string comparison for node types. It would be much preferred to have a visit similar to variant - which is what the Action interface is all about. So maybe this is a riff on #248 ? Alternatively, if one wants to allow 'require' and 'option' as identifiers in some contexts, and keywords in others, is there an idiomatic way to express that? I want to avoid defining identifiers as: [a-z][a-z_0-9]* except 'require', 'option', ... (like the lua53.cpp example does) |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 1 reply
-
You should look at the "at<>" rule. This rule does exactly what you want. It attempts to parse, (but not consume), input. If the match is successful, no actions are called. You can leverage this with the if_then, or if_must rules like this: struct require : keyword<'r', 'e', 'q', 'u', 'i', 'r', 'e'> { };
struct option : keyword<'o', 'p', 't', 'i', 'o', 'n'> { };
struct require_option :
if_must< at< require >,
seq< require, option >
> { }; Obviously, you'll have to add appropriate padding for white space, but I wanted a bit of clarity here. This does parse the input twice, but the second time doesn't have "perfect foreknowledge". It's possible to do what you want with a bit of work. You'd have to write your own "at"-like rule which would have to keep track of state information and cache the knowledge. Then if the parse is successful, consume the input, and trigger all of the required actions in the appropriate order. I'd say, depending on how complex your grammar is, that it's kind of a toss-up as to whether or not it just makes more sense to parse the input the second time as is, however. |
Beta Was this translation helpful? Give feedback.
-
An easy way of visiting the nodes in an idiomatic way is to define your own node, inheriting from parse_tree::node, and adding a variant field. Then simply use std::visit and the visitor pattern.
This will automatically dispatch to the appropriate handler based on what is in the variant. For this to work, you must have unique types in the variant. (e.g. you cannot have two or more std::string entities in the variant) |
Beta Was this translation helpful? Give feedback.
-
A way of defining "require" as a keyword in some contexts and as an identifier in others is to use an action rule that returns a bool, See the documentation here: https://github.com/taocpp/PEGTL/blob/main/doc/Actions-and-States.md#boolean-return and also here: https://github.com/taocpp/PEGTL/blob/main/doc/Actions-and-States.md#apply This way it is possible to match it as a keyword or as an identifier based on parser states which are automatically passed to the action. |
Beta Was this translation helpful? Give feedback.
-
A way to reasonably distinguish between identifiers and keywordsUsing this technique, you can use a grammar rule to specify a keyword where it must be that keyword, and still allow it to be used as an identifier in other contexts. Consider a grammar where "require" is sometimes a keyword, but can be used as an identifier in other places, for example:
This could allow:
Yes, it's a fabricated example, but I think it makes my point. The detailstemplate< Rule >
struct my_action : tao::pegtl::nothing<Rule> { };
template<>
struct my_action<identifier>
{
template <typename ActionInput, typename... States>
static bool apply(const ActionInput& in, States&&... states)
{
static const std::set<std::string> keywords{"require", "option"};
auto found = keywords.find(in.string());
if (found == keywords.end()) {
// not a keyword, must be an identifier.
return true;
}
return false; // must be a keyword.
}
}; In this case the parse states aren't used, but you can use them to determine context, (passed in/changed by other actions), in order to alter the logic, (i.e. "require" is a keyword in context A, but an identifier in context B. The down-side is that you must keep the set of keywords in two places, but this can be ameliorated by use of #include and #define like this: /// keywords.def
xx(require)
xx(option)
xx(declare)
xx(next_keyword)
xx(and_so_on)
...
// end of file
/// grammar.h
// define grammar rules for the individual keywords.
#define xx(a) struct a : public TAO_PEGTL_KEYWORD( #a ) { };
#include "keywords.def"
#undef xx
// Define the set of keywords for later comparison in context.
static const set<std::string> keywords{
#define xx(a) #a,
#include "keywords.def"
#undef xx
}; |
Beta Was this translation helpful? Give feedback.
-
Wow, thank you @skyrich62 - that is a lot of input, it will take me a bit to digest and see what works for me. Thanks again! |
Beta Was this translation helpful? Give feedback.
-
I currently don't have much time, so I just want to mention two things:
|
Beta Was this translation helpful? Give feedback.
A way to reasonably distinguish between identifiers and keywords
Using this technique, you can use a grammar rule to specify a keyword where it must be that keyword, and still allow it to be used as an identifier in other contexts.
Consider a grammar where "require" is sometimes a keyword, but can be used as an identifier in other places, for example:
This could allow: