Add support for the new f-string tokens per PEP 701 (#6659)

## Summary This PR adds support in the lexer for the newly added f-string tokens as per PEP 701. The following new tokens are added: * `FStringStart`: Token value for the start of an f-string. This includes the `f`/`F`/`fr` prefix and the opening quote(s). * `FStringMiddle`: Token value that includes the portion of text inside the f-string that's not part of the expression part and isn't an opening or closing brace. * `FStringEnd`: Token value for the end of an f-string. This includes the closing quote. Additionally, a new `Exclamation` token is added for conversion (`f"{foo!s}"`) as that's part of an expression. ## Test Plan New test cases are added to for various possibilities using snapshot testing. The output has been verified using python/cpython@f2cc00527e. ## Benchmarks _I've put the number of f-strings for each of the following files after the file name_ ``` lexer/large/dataset.py (1) 1.05 612.6±91.60µs 66.4 MB/sec 1.00 584.7±33.72µs 69.6 MB/sec lexer/numpy/ctypeslib.py (0) 1.01 131.8±3.31µs 126.3 MB/sec 1.00 130.9±5.37µs 127.2 MB/sec lexer/numpy/globals.py (1) 1.02 13.2±0.43µs 222.7 MB/sec 1.00 13.0±0.41µs 226.8 MB/sec lexer/pydantic/types.py (8) 1.13 285.0±11.72µs 89.5 MB/sec 1.00 252.9±10.13µs 100.8 MB/sec lexer/unicode/pypinyin.py (0) 1.03 32.9±1.92µs 127.5 MB/sec 1.00 31.8±1.25µs 132.0 MB/sec ``` It seems that overall the lexer has regressed. I profiled every file mentioned above and I saw one improvement which is done in (098ee5d). But otherwise I don't see anything else. A few notes by isolating the f-string part in the profile: * As we're adding new tokens and functionality to emit them, I expect the lexer to take more time because of more code. * The `lex_fstring_middle_or_end` takes the most amount of time followed by the `current_mut` line when lexing the `:` token. The latter is to check if we're at the start of a format spec or not. * In a f-string heavy file such as https://github.com/python/cpython/blob/main/Lib/test/test_fstring.py [^1] (293), most of the time in `lex_fstring_middle_or_end` is accounted by string allocation for the string literal part of `FStringMiddle` token (https://share.firefox.dev/3ErEa1W) I don't see anything out of ordinary for `pydantic/types` profile (https://share.firefox.dev/45XcLRq) fixes: #7042 [^1]: We could add this in lexer and parser benchmark
astral-sh · Sep 18, 2023 · 470afc0 · 470afc0
1 parent c2bd8af
commit 470afc0
Show file tree

Hide file tree

Showing 24 changed files with 2,317 additions and 11 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/crates/ruff_python_parser/Cargo.toml b/crates/ruff_python_parser/Cargo.toml
@@ -18,6 +18,7 @@ ruff_python_ast = { path = "../ruff_python_ast" }
 ruff_text_size = { path = "../ruff_text_size" }
 
 anyhow = { workspace = true }
+bitflags = { workspace = true }
 is-macro = { workspace = true }
 itertools = { workspace = true }
 lalrpop-util = { version = "0.20.0", default-features = false }

diff --git a/crates/ruff_python_parser/src/lexer.rs b/crates/ruff_python_parser/src/lexer.rs
diff --git a/crates/ruff_python_parser/src/lexer/cursor.rs b/crates/ruff_python_parser/src/lexer/cursor.rs
@@ -96,6 +96,18 @@ impl<'a> Cursor<'a> {
         }
     }
 
+    pub(super) fn eat_char3(&mut self, c1: char, c2: char, c3: char) -> bool {
+        let mut chars = self.chars.clone();
+        if chars.next() == Some(c1) && chars.next() == Some(c2) && chars.next() == Some(c3) {
+            self.bump();
+            self.bump();
+            self.bump();
+            true
+        } else {
+            false
+        }
+    }
+
     pub(super) fn eat_if<F>(&mut self, mut predicate: F) -> Option<char>
     where
         F: FnMut(char) -> bool,

diff --git a/crates/ruff_python_parser/src/lexer/fstring.rs b/crates/ruff_python_parser/src/lexer/fstring.rs
@@ -0,0 +1,158 @@
+use bitflags::bitflags;
+
+use ruff_text_size::TextSize;
+
+bitflags! {
+    #[derive(Debug)]
+    pub(crate) struct FStringContextFlags: u8 {
+        /// The current f-string is a triple-quoted f-string i.e., the number of
+        /// opening quotes is 3. If this flag is not set, the number of opening
+        /// quotes is 1.
+        const TRIPLE = 1 << 0;
+
+        /// The current f-string is a double-quoted f-string. If this flag is not
+        /// set, the current f-string is a single-quoted f-string.
+        const DOUBLE = 1 << 1;
+
+        /// The current f-string is a raw f-string i.e., prefixed with `r`/`R`.
+        /// If this flag is not set, the current f-string is a normal f-string.
+        const RAW = 1 << 2;
+    }
+}
+
+/// The context representing the current f-string that the lexer is in.
+#[derive(Debug)]
+pub(crate) struct FStringContext {
+    flags: FStringContextFlags,
+
+    /// The level of nesting for the lexer when it entered the current f-string.
+    /// The nesting level includes all kinds of parentheses i.e., round, square,
+    /// and curly.
+    nesting: u32,
+
+    /// The current depth of format spec for the current f-string. This is because
+    /// there can be multiple format specs nested for the same f-string.
+    /// For example, `{a:{b:{c}}}` has 3 format specs.
+    format_spec_depth: u32,
+}
+
+impl FStringContext {
+    pub(crate) const fn new(flags: FStringContextFlags, nesting: u32) -> Self {
+        Self {
+            flags,
+            nesting,
+            format_spec_depth: 0,
+        }
+    }
+
+    pub(crate) const fn nesting(&self) -> u32 {
+        self.nesting
+    }
+
+    /// Returns the quote character for the current f-string.
+    pub(crate) const fn quote_char(&self) -> char {
+        if self.flags.contains(FStringContextFlags::DOUBLE) {
+            '"'
+        } else {
+            '\''
+        }
+    }
+
+    /// Returns the number of quotes for the current f-string.
+    pub(crate) const fn quote_size(&self) -> TextSize {
+        if self.is_triple_quoted() {
+            TextSize::new(3)
+        } else {
+            TextSize::new(1)
+        }
+    }
+
+    /// Returns the triple quotes for the current f-string if it is a triple-quoted
+    /// f-string, `None` otherwise.
+    pub(crate) const fn triple_quotes(&self) -> Option<&'static str> {
+        if self.is_triple_quoted() {
+            if self.flags.contains(FStringContextFlags::DOUBLE) {
+                Some(r#"""""#)
+            } else {
+                Some("'''")
+            }
+        } else {
+            None
+        }
+    }
+
+    /// Returns `true` if the current f-string is a raw f-string.
+    pub(crate) const fn is_raw_string(&self) -> bool {
+        self.flags.contains(FStringContextFlags::RAW)
+    }
+
+    /// Returns `true` if the current f-string is a triple-quoted f-string.
+    pub(crate) const fn is_triple_quoted(&self) -> bool {
+        self.flags.contains(FStringContextFlags::TRIPLE)
+    }
+
+    /// Calculates the number of open parentheses for the current f-string
+    /// based on the current level of nesting for the lexer.
+    const fn open_parentheses_count(&self, current_nesting: u32) -> u32 {
+        current_nesting.saturating_sub(self.nesting)
+    }
+
+    /// Returns `true` if the lexer is in a f-string expression i.e., between
+    /// two curly braces.
+    pub(crate) const fn is_in_expression(&self, current_nesting: u32) -> bool {
+        self.open_parentheses_count(current_nesting) > self.format_spec_depth
+    }
+
+    /// Returns `true` if the lexer is in a f-string format spec i.e., after a colon.
+    pub(crate) const fn is_in_format_spec(&self, current_nesting: u32) -> bool {
+        self.format_spec_depth > 0 && !self.is_in_expression(current_nesting)
+    }
+
+    /// Returns `true` if the context is in a valid position to start format spec
+    /// i.e., at the same level of nesting as the opening parentheses token.
+    /// Increments the format spec depth if it is.
+    ///
+    /// This assumes that the current character for the lexer is a colon (`:`).
+    pub(crate) fn try_start_format_spec(&mut self, current_nesting: u32) -> bool {
+        if self
+            .open_parentheses_count(current_nesting)
+            .saturating_sub(self.format_spec_depth)
+            == 1
+        {
+            self.format_spec_depth += 1;
+            true
+        } else {
+            false
+        }
+    }
+
+    /// Decrements the format spec depth unconditionally.
+    pub(crate) fn end_format_spec(&mut self) {
+        self.format_spec_depth = self.format_spec_depth.saturating_sub(1);
+    }
+}
+
+/// The f-strings stack is used to keep track of all the f-strings that the
+/// lexer encounters. This is necessary because f-strings can be nested.
+#[derive(Debug, Default)]
+pub(crate) struct FStrings {
+    stack: Vec<FStringContext>,
+}
+
+impl FStrings {
+    pub(crate) fn push(&mut self, context: FStringContext) {
+        self.stack.push(context);
+    }
+
+    pub(crate) fn pop(&mut self) -> Option<FStringContext> {
+        self.stack.pop()
+    }
+
+    pub(crate) fn current(&self) -> Option<&FStringContext> {
+        self.stack.last()
+    }
+
+    pub(crate) fn current_mut(&mut self) -> Option<&mut FStringContext> {
+        self.stack.last_mut()
+    }
+}
diff --git a/...es/ruff_python_parser/src/snapshots/ruff_python_parser__lexer__tests__empty_fstrings.snap b/...es/ruff_python_parser/src/snapshots/ruff_python_parser__lexer__tests__empty_fstrings.snap
@@ -0,0 +1,66 @@
+---
+source: crates/ruff_python_parser/src/lexer.rs
+expression: lex_source(source)
+---
+[
+    (
+        FStringStart,
+        0..2,
+    ),
+    (
+        FStringEnd,
+        2..3,
+    ),
+    (
+        String {
+            value: "",
+            kind: String,
+            triple_quoted: false,
+        },
+        4..6,
+    ),
+    (
+        FStringStart,
+        7..9,
+    ),
+    (
+        FStringEnd,
+        9..10,
+    ),
+    (
+        FStringStart,
+        11..13,
+    ),
+    (
+        FStringEnd,
+        13..14,
+    ),
+    (
+        String {
+            value: "",
+            kind: String,
+            triple_quoted: false,
+        },
+        15..17,
+    ),
+    (
+        FStringStart,
+        18..22,
+    ),
+    (
+        FStringEnd,
+        22..25,
+    ),
+    (
+        FStringStart,
+        26..30,
+    ),
+    (
+        FStringEnd,
+        30..33,
+    ),
+    (
+        Newline,
+        33..33,
+    ),
+]
diff --git a/crates/ruff_python_parser/src/snapshots/ruff_python_parser__lexer__tests__fstring.snap b/crates/ruff_python_parser/src/snapshots/ruff_python_parser__lexer__tests__fstring.snap
@@ -0,0 +1,88 @@
+---
+source: crates/ruff_python_parser/src/lexer.rs
+expression: lex_source(source)
+---
+[
+    (
+        FStringStart,
+        0..2,
+    ),
+    (
+        FStringMiddle {
+            value: "normal ",
+            is_raw: false,
+        },
+        2..9,
+    ),
+    (
+        Lbrace,
+        9..10,
+    ),
+    (
+        Name {
+            name: "foo",
+        },
+        10..13,
+    ),
+    (
+        Rbrace,
+        13..14,
+    ),
+    (
+        FStringMiddle {
+            value: " {another} ",
+            is_raw: false,
+        },
+        14..27,
+    ),
+    (
+        Lbrace,
+        27..28,
+    ),
+    (
+        Name {
+            name: "bar",
+        },
+        28..31,
+    ),
+    (
+        Rbrace,
+        31..32,
+    ),
+    (
+        FStringMiddle {
+            value: " {",
+            is_raw: false,
+        },
+        32..35,
+    ),
+    (
+        Lbrace,
+        35..36,
+    ),
+    (
+        Name {
+            name: "three",
+        },
+        36..41,
+    ),
+    (
+        Rbrace,
+        41..42,
+    ),
+    (
+        FStringMiddle {
+            value: "}",
+            is_raw: false,
+        },
+        42..44,
+    ),
+    (
+        FStringEnd,
+        44..45,
+    ),
+    (
+        Newline,
+        45..45,
+    ),
+]