Initial lexing support for integer literals following #143. #269

zygoloid · 2021-02-12T23:56:59Z

No description provided.

chandlerc

Exciting!

lexer/tokenized_buffer.cpp

chandlerc · 2021-02-13T02:03:55Z

lexer/tokenized_buffer.cpp

+    // For decimal and hexadecimal digit sequences, digit separators must form
+    // groups of 3 or 4 digits (4 or 5 characters), respectively.
+    if (radix != 2) {
+      // Check for digit separators in the expected positions.
+      unsigned stride = (radix == 10 ? 4 : 5);
+      for (auto pos = text.end(); pos - text.begin() >= stride; /*in loop*/) {
+        pos -= stride;
+        if (*pos != '_') {
+          emitter.EmitError<IrregularDigitSeparators>(
+              [&](IrregularDigitSeparators::Substitutions &subst) {
+                subst.radix = radix;
+              });
+          buffer.has_errors = true;
+          digit_separators = 0;
+          break;
+        }
+        --digit_separators;
+      }
+
+      // Check there weren't any other digit separators.
+      if (digit_separators) {
+        emitter.EmitError<IrregularDigitSeparators>(
+            [&](IrregularDigitSeparators::Substitutions &subst) {
+              subst.radix = radix;
+            });
+        buffer.has_errors = true;
+      }
+    }


Extract this to a helper function? Should also make the conditions a bit simpler:

if (radix != 2 & digit_separators) CheckDigitSeparatorSequences(...)` return {.ok = true, .has_digit_separators = digit_separators}`

Done. I used a lambda rather than a separate function because this code has invariants that the caller sets up (specifically that it's given the number of digit separators found in the string).

FWIW, I don't think this helps the readability as much as extracting the function would. It somewhat still forces the reader to work through the long function body.

I'm just suggesting a file-local helper function so I don't think the invariants are too complex? The code even seems to already check them with asserts?

OK, done. The code didn't already check its invariants with asserts, except in one corner case; now that it's (in principle) callable from elsewhere in the file, I've made it do so.

chandlerc · 2021-02-13T02:15:00Z

lexer/tokenized_buffer.cpp

+      if ((c >= '0' && c <= max_decimal) ||
+          (radix == 16 && c >= 'A' && c <= 'Z')) {
+        continue;
+      }


I wonder if it'd be easier to read to use a std::bitset<256> here? Setting aside any performance concerns, above it'd be a bit more code but somewhat obvious code setting up the set. And here it'd just be if (valid_digits.test(static_cast<unsigned_char>(c)) { which at least for me is easier to understand than this logic.

Done. I've checked and it looks like we generate good enough code for this: https://godbolt.org/z/o79rsz

I mean, I would have liked https://godbolt.org/z/W76Gq6 more, but I don't suppose I can nerd-snipe anyone into getting the optimizer to produce that... :)

OMG, why-oh-why did you have to show me how bad these are? I am so sad now.

Anyways, nothing to do here. I think the more data-oriented implementation is a bit easier to read anyways, and we can replace the abstraction if/when desired or meaningful.

lexer/tokenized_buffer.cpp

lexer/tokenized_buffer_test.cpp

lexer/tokenized_buffer.cpp

fowles · 2021-02-13T02:55:35Z

lexer/tokenized_buffer.cpp

+        pos -= stride;
+        if (*pos != '_') {
+          emitter.EmitError<IrregularDigitSeparators>(
+              [&](IrregularDigitSeparators::Substitutions &subst) {


idle speculation: [&](auto& subt) might be a nice idiom for this since the type is already stated earlier and the replication is not worth a ton. Alternately, I wonder if there is a way to infer the template from the lambda's parameters.

I'd like to restructure how EmitError works in general, though as a separate patch -- I think we should be returning the substitutions by value rather than mutating an uninitialized object. But that would remove our ability to use auto. I think it'd also make sense to have an overload that just takes the substitutions directly, for the case where there is no overhead in computing them. (Which is always, when emitting an error, because -- I hope! -- errors are always emitted.)

OK if I defer doing things here to a follow-on patch?

chandlerc · 2021-02-19T10:18:17Z

lexer/tokenized_buffer.cpp

        continue;
      }

      if (c == '_') {
        // A digit separator cannot appear at the start of a digit sequence,
        // next to another digit separator, or at the end.
-        if (it == text.begin() || it[-1] == '_' || it + 1 == text.end()) {
+        if (i == 0 || text[i-1] == '_' || i + 1 == n) {


Formatting nit:

Suggested change

if (i == 0 || text[i-1] == '_' || i + 1 == n) {

if (i == 0 || text[i - 1] == '_' || i + 1 == n) {

Does clang-format not fix this? Wondering if we're missing a setting on it...

clang-format does fix it. Applied.

chandlerc · 2021-02-19T10:21:27Z

lexer/tokenized_buffer.cpp

+      if ((c >= '0' && c <= max_decimal) ||
+          (radix == 16 && c >= 'A' && c <= 'Z')) {
+        continue;
+      }


OMG, why-oh-why did you have to show me how bad these are? I am so sad now.

Anyways, nothing to do here. I think the more data-oriented implementation is a bit easier to read anyways, and we can replace the abstraction if/when desired or meaningful.

chandlerc

LGTM with a formatting nit fix and fully extracting the helper function.

If extracting the helper function really isn't resulting in a good solution for you, still LGTM but I'd like to revisit what would work better here -- trying to help the folks who find a distinct advantage from shorter function bodies for readability.

relevant detail.

Initial lexing support for integer literals following #143.

87506d4

zygoloid requested a review from chandlerc February 12, 2021 23:56

google-cla bot added the cla: yes PR meets CLA requirements according to bot. label Feb 12, 2021

chandlerc requested changes Feb 13, 2021

View reviewed changes

fowles reviewed Feb 13, 2021

View reviewed changes

zygoloid added 2 commits February 16, 2021 12:31

Address review feedback.

fcd30d6

Switch to using a bitset to determine digit validity.

94b3e72

zygoloid mentioned this pull request Feb 18, 2021

Initial lexing support for real literals following #143. #273

Merged

chandlerc reviewed Feb 19, 2021

View reviewed changes

chandlerc approved these changes Feb 19, 2021

View reviewed changes

zygoloid added 3 commits February 19, 2021 13:26

Address review comments.

57a7406

Remove redundant assert.

9ff89cc

Switch over two comments so the more detailed comment is closer to the

396f487

relevant detail.

zygoloid merged commit 8ec213a into carbon-language:trunk Feb 19, 2021

zygoloid deleted the lexer branch February 19, 2021 22:02

chandlerc pushed a commit that referenced this pull request Jun 28, 2022

Initial lexing support for integer literals following #143. (#269)

7670b63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial lexing support for integer literals following #143. #269

Initial lexing support for integer literals following #143. #269

zygoloid commented Feb 12, 2021

chandlerc left a comment

chandlerc Feb 13, 2021

zygoloid Feb 16, 2021

chandlerc Feb 19, 2021

zygoloid Feb 19, 2021

chandlerc Feb 13, 2021

zygoloid Feb 16, 2021

chandlerc Feb 19, 2021

fowles Feb 13, 2021

zygoloid Feb 16, 2021

fowles Feb 18, 2021

chandlerc Feb 19, 2021

zygoloid Feb 19, 2021

chandlerc Feb 19, 2021

chandlerc left a comment

	if (i == 0 \|\| text[i-1] == '_' \|\| i + 1 == n) {
	if (i == 0 \|\| text[i - 1] == '_' \|\| i + 1 == n) {

Initial lexing support for integer literals following #143. #269

Initial lexing support for integer literals following #143. #269

Conversation

zygoloid commented Feb 12, 2021

chandlerc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chandlerc left a comment

Choose a reason for hiding this comment