proposal: a simplified and generalized invalid-name config #7305

bukzor · 2022-08-13T19:34:08Z

Current problem

The current configuration system for naming style in pylint can be seen as simultaneously overly rigid and overly flexible. There are two distinct ways to configure naming schemes. Primarily there is a named-scheme selection where users choose among lower_case UPPER_CASE PascalCase, etc., and there is a regex-based (-rgx) configuration where the user meant to write a regular expression that matches all (and only) permissible variable names.

As a concrete example:

[BASIC]
const-naming-style=UPPER_CASE
# from https://github.com/openqasm/openqasm/blob/main/.pylintrc
variable-rgx=[a-z_][a-z0-9_]{2,30}$

The named-style is convenient for users, but gives no clarity as to what exactly will match, and admits no ability to make small tweaks to its definition. The regex configuration is its near opposite, being completely unambiguous and easy to modify but painful-to-impossible to use for the average human.

Desired solution

This is my (tentative!) proposal: (keep in mind that all details will/would be modified to reflect maintainers' requirements)

Let's move this configuration out of "basic" to a new, separate "naming" namespace.

[NAMING]
variable = lower_case

[NAMING.lower_case]
allowed-characters = [a-zA-Z0-9_]
allowed-first-character = [a-z]
min-length = 3
max-length  = 0  # no limit

Or, alternatively: (I prefer the above)

[NAMING]
upper-case.allowed-characters = [A-Z0-9_]
upper-case.allowed-first-character = [A-Z]
upper-case.min-length = 0  # no limit
upper-case.max-length  = 30

Each naming style is be defined by just four characteristics: allowed characters, allowed first-letter character, minimum and maximum length. If we can expose those quadruplets to the configuration, then the configuration is easy-to-use (users can largely ignore these fine details), unambiguous, easy to modify, and there's no (obvious) need to expose users to the full footgun of regular expressions.

Note: The "character" configs are required to be a single character-class (i.e. a single-char regex beginning and ending with square brackets). This limitation allows excellent error messaging in the case of a invalid-name error, but also enables programmatic use.

This obviously enables the potential (but not requirement) to allow users to define custom named naming-schemes.

All naming schemes would then be defined as (approximately) f'_{{0,2}}{allowed-first-character}{allowed-characters}{{{min-length},{max-length}}}'

I believe the regular-expression-based configuration could then be phased out entirely, but could of course be retained if wanted as a fifth, optional regex attribute of the naming scheme with default None.

Additional context

This idea was split out of #3704, which considers revamping the invalid-name checker.

I (@bukzor) am volunteering to implement this change, if ratified.

The text was updated successfully, but these errors were encountered:

DanielNoord · 2022-08-14T11:41:45Z

Thanks for the clear proposal @bukzor!

I'm not sure though whether I agree with the changes you propose though. I understand that regular expressions can be hard to grasp for (new) developers, but adding to the complexity by adding 4 new subsettings for each type of "nameable" doesn't seem like it solves the issue of being "overly rigid and overly flexible".
Why would we give our users less possibilities (you can no longer write a regular expression yourself) only to have them avoid writing a regular expression once?

Pierre-Sassoulas · 2022-08-14T13:35:02Z

Thank you for the recap @bukzor.

I think building on what exists right now and fixing the current issues is safer in this case. Redesigning something entirely is a very risky endeavor. We have insight on what is wrong with what we currently have, but not what will be wrong on something we design from scratch. Seeing how much pylint is used there will be something wrong. Every-time we add a new checker we can expect issues to pop up. This is the case for checker affecting a particular statement like consider-using-with that trigger only on context manager... invalid-name is triggering on every name and we're going to miss a dozen edge cases whatever we do (especially with 4 new options added).

I think the major pain point of invalid-name is that handling short name is hard with a regex and the regex also hide the reason why the name was rejected (#2018, the second most upvoted issue in the repo and the message that personally irk me every time I create a new project that prevent me from using pylint without configuration and then prevent me from actually checking for proper PEP8 naming).

So in my opinion what we need is fix #2018 the simplest way we can : For that we need to separate short name and a regex that does not match logically in the code. This is already a lot of work to get it right, as we need to not break any existing disable/enable, refactor a lot of code and functional tests and upgrade the regex for all the name style. I started to work on it for maybe 10 hours before my disk crashed and loosing the code and it was half done at best.

Mostly copy pasting my comment from the previous discussion:

A name that is in the list of accepted name should be ok for both checks.
We need to decide if we raise both invalid-name and name-too-short for name that are violating both. I think we should raise both for consistency but the regex need to be updated as they're not matching valid name that are too short right now.
Also If we raise both we need to decide what happens for name that are short and ambiguous (A is both UPPERCASE, UPPERCASE_WITH_UNDERSCORE and PascalCase, a is both camelCase and snake_case). Right now I think we should allow both. Also ab should be valid snake case but too short, and aB should be invalid and too short, Ab should be valid PascalCase, but not ab.

snake_case name	A	a	Ab	ab	Abcd	abcd
invalid-name	yes	no	yes	no	yes	no
name-too-short	yes	yes	yes	yes	no	no

We need to decide what happens when the user is not using the default regex (maybe they handled name too short in it already?)
There's also name-too-longfor name > 30 characters that was included in the regex and I forgot about it because I never hit it myself

bukzor · 2022-08-14T19:31:52Z

Thanks for consideration. You can close this if you like. I may reopen this discussion when/if other changes have already landed.

bukzor · 2022-08-14T19:48:50Z

@DanielNoord

Perhaps what I failed to explain is that these four "subsettings" of each named style already exist, in the current main branch. It's only that they're not factored as such and not exposed to the configuration system. I intend to (with your approval, as a successor to #3704 ) refactor as such, so the only change proposed here is to expose those hard-coded parameters as default configuration.

regular expressions can be hard to grasp for (new) developers

I believe you underestimate this issue dramatically. I estimate more than 75% of self-described "experienced developers" would fail to write a bug free regex for one of these, given three attempts. While regular expressions are very compact, each token (very nearly each character) doubles the state space, which gets beyond human capability very quickly. Consider the number of responsibilities and edge cases handled by the default name regular expression: ([^\W\dA-Z][^\WA-Z]{2,}|_[^\WA-Z]*|__[^\WA-Z\d_][^\WA-Z]+__)$. I think if you asked ten people to count them you'd get ten answers.

I've probably overstated this point dramatically, but I'm only trying to be clear :)

... adding to the complexity by adding 4 new subsettings for each type of "nameable" ...

I believe the total complexity remains the same. It's only that complexity has been shifted away from the user and into the system (where it ideally belongs). Each of the four new subsettings is entirely simple (has a singular, obvious purpose, as opposed to complex).

... doesn't seem like it solves the issue of being "overly rigid and overly flexible".

By boiling down the naming schemes to their essential, atomic parameters and leaving all other considerations to the framework, the system is quite flexible (all current naming schemes and millions more can be represented) but not overly so: it's no longer possible to accidentally disallow __init__ as a module, and we gain the (optional) ability to verify that the user hasn't allowed numbers as the first character or dashes in the name, etc.

you can no longer write a regular expression yourself

That's only an optional part of this proposal. It's not necessary to remove the -rgx configuration. I only meant to say that I believe it would serve no purpose and so could be removed.

That said, the user-facing gains are relatively minor and the implementation cost is not.

bukzor · 2022-08-14T20:02:21Z

@Pierre-Sassoulas

we're going to miss a dozen edge cases whatever we do (especially with 4 new options added).

I think you've neglected to consider that the four proposed options have just four degrees of freedom, while the current regular expressions have (much more than) thousands of degrees of freedom. This means that in the proposed scheme it's much easier to consider (and test) the various possiblities, because there are so many orders of magnitude fewer.

I think the major pain point of invalid-name is that handling short name is hard with a regex and the regex also hide the reason why the name was rejected (#2018, the second most upvoted issue in the repo and the message that personally irk me every time I create a new project that prevent me from using pylint without configuration and then prevent me from actually checking for proper PEP8 naming).

I didn't add this because I thought it was incidental, and the proposal already got long, but this proposal fixes #2018 very neatly. Because allowed-letters and allowed-first-letter are parameters (e.g. allowed-letters=[a-z0-9_] allowed-first-letter=[a-z] for snake_case) we can provide extremely clear and specific errors now, such as the parameter name 'Baz' does not conform to snake_case style: the first letter 'B' must be in the range '[a-z]' or the constant name 'MyNumber' does not conform to UPPER_CASE style: character 2 'y' is not in range '[A-Z_]'. You could of course get extra fancy with coloring and ascii-art, but I believe this would be enough for you to become un-irked.

we need to separate short name and a regex that does not match logically in the code. This is already a lot of work to get it right

I could work on that before the rest, then, if you like.

DanielNoord · 2022-08-14T20:51:29Z

I think your reference to degrees of freedom is very helpful for the discussion. What I'm not particularly keen on in your current proposal is that we remove the option with the unlimited degrees of freedom. I think that customisability is a crucial part of pylint and should remain. That said, I can see the value in reducing the amounts of freedom for regular users.

Do you think there would be a way to have -rgx and one or two other options achieve the same? Currently we have 12 types of "nameables". With this proposal we would add 12*4=48 new options, which imo is a bit much.
Note that we can't really use configuration namespaces due to the old config types we also need to support so they will need to be --typevar-naming-min-length or something like that.

Pierre-Sassoulas · 2022-08-15T07:30:16Z

I believe the total complexity remains the same. It's only that complexity has been shifted away from the user and into the system (where it ideally belongs).

Yeah, that and the degree of freedom (that we were dumping on the user so it's not "our" problem) convinced me. Also two small regexes still give a as much freedom, unless I'm mistaken. Nothing prevent someone from just dumping their arbitrarily complex regex in the allowed characters option, they'll just have to migrate the first letter handling in the first letter option and the length handling in the length options.

we can provide extremely clear and specific errors now

It would be great for sure.

Currently we have 12 types of "nameables". With this proposal we would add 12*4=48 new options, which imo is a bit much.

As long as there are default values for PEP8 and most users do not use those options, I don't think it's an issue.

Maybe at first we could start by:

Adding the too-[short|long]-name messages and options
Remove the length part of the default regexes
Make those options mutually exclusive with old options that are incompatible
(Warn the user that use quantifier {3-30} in their non default regex on how to migrate?)

Then later (next minor?):

Add the first letter and allowed letters options
Make them mutually exclusive with old options that are incompatible
Make the new options the default
Deprecate the old regex option in favor of the two new simpler regexes
Warn the user that do not use the default value with an explanation message on how to migrate

Then in 3.0:

remove the old regex as a breaking change.

I think we need to maintain the old regex for some time but maintaining 36 regexes forever is not something I look forward to 😄

Pierre-Sassoulas · 2022-08-15T07:33:54Z

I just though that the regex allowed to have different length criteria for test functions (i.e. allow super verbose function name if it starts by test_). It's possible to keep doing that with new options by checking that allowed letter start by est_ but it's hacky.

I'm probably missing other use cases. Maybe an argument to keep the old regex and not deprecate.

Wehzie · 2022-09-07T16:56:00Z

Could pylint adapt naming conventions of mainstream packages and add those as defaults to an exception list whenever these are imported?

For example, consider the purpose of data science and machine learning, as in this issue.

A typical data science project will use some of the following projects,
pandas -> add "df"
maplotlib -> add "ax"
sklearn -> add "X" and "y"
scipy -> add "f", ..., "fn", "t", "x", "y"

In case a user wants scipy defaults, except for example "f", there should be a possibility for that.

bukzor added the Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling label Aug 13, 2022

Pierre-Sassoulas added Enhancement ✨ Improvement to a component Needs specification 🔐 Accepted as a potential improvement, and needs to specify edge cases, message names, etc. and removed Needs triage 📥 Just created, needs acknowledgment, triage, and proper labelling labels Aug 13, 2022

This was referenced Aug 19, 2022

Add invalid-name check for TypeAlias names #7116

Merged

variable-naming-style toml configuration: only a single style is allowed? #7352

Open

This was referenced Nov 21, 2022

Show why snake_case errors is emitted on single word variables (might be too small) #2018

Closed

Update disallowed-name to flag module-level variables #7808

Merged

Bug: Variable name "tx" doesn't conform to snake_case naming style(invalid-name) #2970

Closed

Pierre-Sassoulas mentioned this issue Dec 21, 2022

inlinevar-rgx seems to react to scope #7969

Closed

Pierre-Sassoulas mentioned this issue Jan 29, 2023

Pylint calls an issue with variables with one character (like x or x1) for the camelCase check #8118

Closed

This was referenced Feb 18, 2023

snake_case (invalid-name) for single letter variables in classes #8309

Closed

Prevent emitting invalid-name #8337

Merged

Pierre-Sassoulas mentioned this issue Jul 4, 2023

Stop checking minimum name length by default #8813

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: a simplified and generalized invalid-name config #7305

proposal: a simplified and generalized invalid-name config #7305

bukzor commented Aug 13, 2022

DanielNoord commented Aug 14, 2022

Pierre-Sassoulas commented Aug 14, 2022 •

edited

Loading

bukzor commented Aug 14, 2022

bukzor commented Aug 14, 2022

bukzor commented Aug 14, 2022

DanielNoord commented Aug 14, 2022

Pierre-Sassoulas commented Aug 15, 2022 •

edited

Loading

Pierre-Sassoulas commented Aug 15, 2022 •

edited

Loading

Wehzie commented Sep 7, 2022

proposal: a simplified and generalized invalid-name config #7305

proposal: a simplified and generalized invalid-name config #7305

Comments

bukzor commented Aug 13, 2022

Current problem

Desired solution

Additional context

DanielNoord commented Aug 14, 2022

Pierre-Sassoulas commented Aug 14, 2022 • edited Loading

bukzor commented Aug 14, 2022

bukzor commented Aug 14, 2022

bukzor commented Aug 14, 2022

DanielNoord commented Aug 14, 2022

Pierre-Sassoulas commented Aug 15, 2022 • edited Loading

Pierre-Sassoulas commented Aug 15, 2022 • edited Loading

Wehzie commented Sep 7, 2022

Pierre-Sassoulas commented Aug 14, 2022 •

edited

Loading

Pierre-Sassoulas commented Aug 15, 2022 •

edited

Loading

Pierre-Sassoulas commented Aug 15, 2022 •

edited

Loading