-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial lexing support for integer literals following #143. #269
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exciting!
lexer/tokenized_buffer.cpp
Outdated
// For decimal and hexadecimal digit sequences, digit separators must form | ||
// groups of 3 or 4 digits (4 or 5 characters), respectively. | ||
if (radix != 2) { | ||
// Check for digit separators in the expected positions. | ||
unsigned stride = (radix == 10 ? 4 : 5); | ||
for (auto pos = text.end(); pos - text.begin() >= stride; /*in loop*/) { | ||
pos -= stride; | ||
if (*pos != '_') { | ||
emitter.EmitError<IrregularDigitSeparators>( | ||
[&](IrregularDigitSeparators::Substitutions &subst) { | ||
subst.radix = radix; | ||
}); | ||
buffer.has_errors = true; | ||
digit_separators = 0; | ||
break; | ||
} | ||
--digit_separators; | ||
} | ||
|
||
// Check there weren't any other digit separators. | ||
if (digit_separators) { | ||
emitter.EmitError<IrregularDigitSeparators>( | ||
[&](IrregularDigitSeparators::Substitutions &subst) { | ||
subst.radix = radix; | ||
}); | ||
buffer.has_errors = true; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract this to a helper function? Should also make the conditions a bit simpler:
if (radix != 2 & digit_separators)
CheckDigitSeparatorSequences(...)`
return {.ok = true, .has_digit_separators = digit_separators}`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I used a lambda rather than a separate function because this code has invariants that the caller sets up (specifically that it's given the number of digit separators found in the string).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I don't think this helps the readability as much as extracting the function would. It somewhat still forces the reader to work through the long function body.
I'm just suggesting a file-local helper function so I don't think the invariants are too complex? The code even seems to already check them with asserts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done. The code didn't already check its invariants with asserts, except in one corner case; now that it's (in principle) callable from elsewhere in the file, I've made it do so.
lexer/tokenized_buffer.cpp
Outdated
if ((c >= '0' && c <= max_decimal) || | ||
(radix == 16 && c >= 'A' && c <= 'Z')) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it'd be easier to read to use a std::bitset<256>
here? Setting aside any performance concerns, above it'd be a bit more code but somewhat obvious code setting up the set. And here it'd just be if (valid_digits.test(static_cast<unsigned_char>(c)) {
which at least for me is easier to understand than this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I've checked and it looks like we generate good enough code for this: https://godbolt.org/z/o79rsz
I mean, I would have liked https://godbolt.org/z/W76Gq6 more, but I don't suppose I can nerd-snipe anyone into getting the optimizer to produce that... :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMG, why-oh-why did you have to show me how bad these are? I am so sad now.
Anyways, nothing to do here. I think the more data-oriented implementation is a bit easier to read anyways, and we can replace the abstraction if/when desired or meaningful.
lexer/tokenized_buffer.cpp
Outdated
pos -= stride; | ||
if (*pos != '_') { | ||
emitter.EmitError<IrregularDigitSeparators>( | ||
[&](IrregularDigitSeparators::Substitutions &subst) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idle speculation: [&](auto& subt)
might be a nice idiom for this since the type is already stated earlier and the replication is not worth a ton. Alternately, I wonder if there is a way to infer the template from the lambda's parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to restructure how EmitError
works in general, though as a separate patch -- I think we should be returning the substitutions by value rather than mutating an uninitialized object. But that would remove our ability to use auto
. I think it'd also make sense to have an overload that just takes the substitutions directly, for the case where there is no overhead in computing them. (Which is always, when emitting an error, because -- I hope! -- errors are always emitted.)
OK if I defer doing things here to a follow-on patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely
lexer/tokenized_buffer.cpp
Outdated
continue; | ||
} | ||
|
||
if (c == '_') { | ||
// A digit separator cannot appear at the start of a digit sequence, | ||
// next to another digit separator, or at the end. | ||
if (it == text.begin() || it[-1] == '_' || it + 1 == text.end()) { | ||
if (i == 0 || text[i-1] == '_' || i + 1 == n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting nit:
if (i == 0 || text[i-1] == '_' || i + 1 == n) { | |
if (i == 0 || text[i - 1] == '_' || i + 1 == n) { |
Does clang-format
not fix this? Wondering if we're missing a setting on it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-format
does fix it. Applied.
lexer/tokenized_buffer.cpp
Outdated
if ((c >= '0' && c <= max_decimal) || | ||
(radix == 16 && c >= 'A' && c <= 'Z')) { | ||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OMG, why-oh-why did you have to show me how bad these are? I am so sad now.
Anyways, nothing to do here. I think the more data-oriented implementation is a bit easier to read anyways, and we can replace the abstraction if/when desired or meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a formatting nit fix and fully extracting the helper function.
If extracting the helper function really isn't resulting in a good solution for you, still LGTM but I'd like to revisit what would work better here -- trying to help the folks who find a distinct advantage from shorter function bodies for readability.
No description provided.