Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial lexing support for integer literals following #143. #269

Merged
merged 6 commits into from
Feb 19, 2021
Merged

Initial lexing support for integer literals following #143. #269

merged 6 commits into from
Feb 19, 2021

Conversation

zygoloid
Copy link
Contributor

No description provided.

@google-cla google-cla bot added the cla: yes PR meets CLA requirements according to bot. label Feb 12, 2021
Copy link
Contributor

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting!

lexer/tokenized_buffer.cpp Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
Comment on lines 289 to 316
// For decimal and hexadecimal digit sequences, digit separators must form
// groups of 3 or 4 digits (4 or 5 characters), respectively.
if (radix != 2) {
// Check for digit separators in the expected positions.
unsigned stride = (radix == 10 ? 4 : 5);
for (auto pos = text.end(); pos - text.begin() >= stride; /*in loop*/) {
pos -= stride;
if (*pos != '_') {
emitter.EmitError<IrregularDigitSeparators>(
[&](IrregularDigitSeparators::Substitutions &subst) {
subst.radix = radix;
});
buffer.has_errors = true;
digit_separators = 0;
break;
}
--digit_separators;
}

// Check there weren't any other digit separators.
if (digit_separators) {
emitter.EmitError<IrregularDigitSeparators>(
[&](IrregularDigitSeparators::Substitutions &subst) {
subst.radix = radix;
});
buffer.has_errors = true;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract this to a helper function? Should also make the conditions a bit simpler:

if (radix != 2 & digit_separators)
  CheckDigitSeparatorSequences(...)`

return {.ok = true, .has_digit_separators = digit_separators}`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I used a lambda rather than a separate function because this code has invariants that the caller sets up (specifically that it's given the number of digit separators found in the string).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't think this helps the readability as much as extracting the function would. It somewhat still forces the reader to work through the long function body.

I'm just suggesting a file-local helper function so I don't think the invariants are too complex? The code even seems to already check them with asserts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done. The code didn't already check its invariants with asserts, except in one corner case; now that it's (in principle) callable from elsewhere in the file, I've made it do so.

Comment on lines 261 to 264
if ((c >= '0' && c <= max_decimal) ||
(radix == 16 && c >= 'A' && c <= 'Z')) {
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it'd be easier to read to use a std::bitset<256> here? Setting aside any performance concerns, above it'd be a bit more code but somewhat obvious code setting up the set. And here it'd just be if (valid_digits.test(static_cast<unsigned_char>(c)) { which at least for me is easier to understand than this logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've checked and it looks like we generate good enough code for this: https://godbolt.org/z/o79rsz

I mean, I would have liked https://godbolt.org/z/W76Gq6 more, but I don't suppose I can nerd-snipe anyone into getting the optimizer to produce that... :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG, why-oh-why did you have to show me how bad these are? I am so sad now.

Anyways, nothing to do here. I think the more data-oriented implementation is a bit easier to read anyways, and we can replace the abstraction if/when desired or meaningful.

lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer_test.cpp Outdated Show resolved Hide resolved
lexer/tokenized_buffer.cpp Outdated Show resolved Hide resolved
pos -= stride;
if (*pos != '_') {
emitter.EmitError<IrregularDigitSeparators>(
[&](IrregularDigitSeparators::Substitutions &subst) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idle speculation: [&](auto& subt) might be a nice idiom for this since the type is already stated earlier and the replication is not worth a ton. Alternately, I wonder if there is a way to infer the template from the lambda's parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to restructure how EmitError works in general, though as a separate patch -- I think we should be returning the substitutions by value rather than mutating an uninitialized object. But that would remove our ability to use auto. I think it'd also make sense to have an overload that just takes the substitutions directly, for the case where there is no overhead in computing them. (Which is always, when emitting an error, because -- I hope! -- errors are always emitted.)

OK if I defer doing things here to a follow-on patch?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely

continue;
}

if (c == '_') {
// A digit separator cannot appear at the start of a digit sequence,
// next to another digit separator, or at the end.
if (it == text.begin() || it[-1] == '_' || it + 1 == text.end()) {
if (i == 0 || text[i-1] == '_' || i + 1 == n) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting nit:

Suggested change
if (i == 0 || text[i-1] == '_' || i + 1 == n) {
if (i == 0 || text[i - 1] == '_' || i + 1 == n) {

Does clang-format not fix this? Wondering if we're missing a setting on it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format does fix it. Applied.

Comment on lines 261 to 264
if ((c >= '0' && c <= max_decimal) ||
(radix == 16 && c >= 'A' && c <= 'Z')) {
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG, why-oh-why did you have to show me how bad these are? I am so sad now.

Anyways, nothing to do here. I think the more data-oriented implementation is a bit easier to read anyways, and we can replace the abstraction if/when desired or meaningful.

Copy link
Contributor

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a formatting nit fix and fully extracting the helper function.

If extracting the helper function really isn't resulting in a good solution for you, still LGTM but I'd like to revisit what would work better here -- trying to help the folks who find a distinct advantage from shorter function bodies for readability.

@zygoloid zygoloid merged commit 8ec213a into carbon-language:trunk Feb 19, 2021
@zygoloid zygoloid deleted the lexer branch February 19, 2021 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes PR meets CLA requirements according to bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants