Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[eradicate] ignore # language= in commented-out-code rule (ERA001) #14069

Merged
merged 1 commit into from
Nov 3, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 18 additions & 1 deletion crates/ruff_linter/src/rules/eradicate/detection.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ static CODE_INDICATORS: LazyLock<AhoCorasick> = LazyLock::new(|| {

static ALLOWLIST_REGEX: LazyLock<Regex> = LazyLock::new(|| {
Regex::new(
r"^(?i)(?:pylint|pyright|noqa|nosec|region|endregion|type:\s*ignore|fmt:\s*(on|off)|isort:\s*(on|off|skip|skip_file|split|dont-add-imports(:\s*\[.*?])?)|mypy:|SPDX-License-Identifier:|(?:en)?coding[:=][ \t]*([-_.a-zA-Z0-9]+))",
r"^(?i)(?:pylint|pyright|noqa|nosec|region|endregion|type:\s*ignore|fmt:\s*(on|off)|isort:\s*(on|off|skip|skip_file|split|dont-add-imports(:\s*\[.*?])?)|mypy:|SPDX-License-Identifier:|language=[a-zA-Z](?: ?[-_.a-zA-Z0-9]+)+(?:\s+prefix=\S+)?(?:\s+suffix=\S+)?|(?:en)?coding[:=][ \t]*([-_.a-zA-Z0-9]+))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested quantifier doesn't look so nice: [a-zA-Z](?: ?[-_.a-zA-Z0-9]+)+. The optional space at the start of the group leads to multiple ways of matching the same text.

I know Rust's regex engine guarantees linear time, but this is nevertheless a bad pattern. Prefer [a-zA-Z][-_.a-zA-Z0-9]*(?: [-_.a-zA-Z0-9]+)* instead.

Technically, language IDs can be anything, so it's perhaps better to drop the leading character requirement: [-_.a-zA-Z0-9]+(?: [-_.a-zA-Z0-9]+)*.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just use language=? All language identifiers will start with that, and false positives seem somewhat rare / not a huge deal here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@InSyncWithFoo maybe I was extra cautious about false positives.

Technically, language IDs can be anything,

Good catch! Maybe we can just require a non-space character after the equals sign. This should take care of most false positives anyway, as it's uncommon for Python users to omit spaces around assignment operators.

The only thing that worries me is about detecting a commented-out named parameter on a multi-line method invocation or declaration. But maybe I'm just being overzealous.

def check_spelling(
  text,
  # language=DEFAULT_LANGUAGE
) ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick search shows that matching on language= alone would lead to many false negatives.

Language IDs are also displayed to the user as part of the UI, so I doubt they would contain non-ASCII characters. I would say limiting to [-_.a-zA-Z0-9] and spaces is the most balanced heuristic.

Copy link
Contributor

@InSyncWithFoo InSyncWithFoo Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] it's uncommon for Python users to omit spaces around assignment operators.

Don't forget keyword arguments, which are recommended to be written with no spaces around the equal sign:

# frobnicate(
#     language='en'  # Could also be `language=EN` with a predefined constant `EN`
# )

I think both this and the false negative you mention are well within the acceptable error margin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @InSyncWithFoo , will you write the PR with the new regexp and test cases, or should I?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll handle it.

Copy link
Contributor

@InSyncWithFoo InSyncWithFoo Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out language IDs in comments must be normalized.

This is JSON Lines (one value on each line, comments allowed):

This is pure JSON (one single top-level value, comments disallowed):

jsonlines must be used even though autocompletion popups use json lines:

Notably, this rule applies to some languages but not others (though Ruff doesn't need to care about this):

).unwrap()
});

Expand Down Expand Up @@ -297,6 +297,23 @@ mod tests {
));
}

#[test]
fn comment_contains_language_injection() {
assert!(comment_contains_code("# language=123", &[]));
assert!(comment_contains_code("# language=\"pt\"", &[]));
assert!(comment_contains_code("# language='en'", &[]));

assert!(!comment_contains_code("# language=xml", &[]));
assert!(!comment_contains_code(
"# language=HTML prefix=<body> suffix=</body>",
&[]
));
assert!(!comment_contains_code(
"# language=ecma script level 4",
&[]
));
}

#[test]
fn comment_contains_todo() {
let task_tags = TASK_TAGS
Expand Down
Loading