-
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add plugin to infer more precise regex match types #7803
base: master
Are you sure you want to change the base?
Conversation
This pull request adds a plugin to make mypy infer more precise types when grabbing regex groups: the plugin will when possible analyze original regex to deduce whether a given group is required or not. ``` from typing_extensions import Final, Literal import re pattern: Final = re.compile("(a)(b)*") match: Final = pattern.match("") if match: reveal_type(match.groups()) # Revealed type is Tuple[str, Optional[str]] reveal_type(match.group(0)) # Revealed type is str reveal_type(match.group(1)) # Revealed type is str reveal_type(match.group(2)) # Revealed type is Optional[str] index: int reveal_type(match.group(index)) # Revealed type is Optional[str] # Error: Regex has 3 total groups, given group number 5 is too big match.group(5) ``` To track this information, I added in an optional 'metadata' dict field to the Instance class, similar to the metadata dict for plugins in TypeInfos. We skip serializing this dict if it does not contain any data. A limitation of this plugin is that both the pattern and the match variables must be declared to be final. Otherwise, we just default to using whatever types are defined in typeshed. This is because we set and erase the metadata field in exactly the same way we set and erase the `last_known_value` field in Instances: both kinds of info are "transient" and are unsafe to keep around if the variable reference is mutable. This limitation *does* end up limiting the usefulness of this plugin to some degree: it won't support common patterns like the below, since variables aren't allowed to be declared final inside loops: ``` for line in file: match = pattern.match(line) if match: ... ``` Possibly we can remove this limitation by making mypy less aggressive about removing this transient info by tracking the "lifetime" of this sort of data in some way? This pull request should mostly address python#7363, though it's unclear if it really fully resolves it: we might want to do something about the limitation described above and re-tune typeshed first. The other mostly unrelated change this PR makes is to refactor some of the helper functions in checker.py into typeops.py so I could use them more cleanly in the plugin.
One other note: it's possible that this plugin is a bit too intrusive to include in mypy itself. While the actual regex-related code is pretty lightweight, it does require us to now have to worry about interfacing with the For example, this PR is currently blocked by python/typeshed#3412. So maybe this is better off being spun out into a separate plugin? Not sure. |
FWIW I think everything related to stdlib (like |
I think that the requirement to use A few other random notes:
|
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Diff from mypy_primer, showing the effect of this PR on open source code: psycopg (https://github.com/psycopg/psycopg)
+ psycopg/psycopg/types/range.py:467: error: Invalid regex: unbalanced parenthesis [misc]
zulip (https://github.com/zulip/zulip)
+ zerver/openapi/markdown_extension.py:176: error: Statement is unreachable [unreachable]
+ zerver/lib/markdown/__init__.py:890: error: Statement is unreachable [unreachable]
+ zerver/lib/compatibility.py:53: error: Statement is unreachable [unreachable]
sympy (https://github.com/sympy/sympy)
+ sympy/parsing/mathematica.py:93: error: Invalid regex: unbalanced parenthesis
+ sympy/parsing/mathematica.py:102: error: Invalid regex: missing ), unterminated subpattern
+ sympy/parsing/mathematica.py:122: error: Invalid regex: unterminated character set
porcupine (https://github.com/Akuli/porcupine)
+ porcupine/plugins/geometry.py:14: error: Statement is unreachable [unreachable]
+ porcupine/plugins/editorconfig.py:150: error: Statement is unreachable [unreachable]
+ porcupine/plugins/editorconfig.py:176: error: Statement is unreachable [unreachable]
dragonchain (https://github.com/dragonchain/dragonchain)
- dragonchain/broadcast_processor/broadcast_functions.py:95:32: error: Item "None" of "Optional[Match[str]]" has no attribute "group"
- dragonchain/lib/authorization.py:315:19: error: Item "None" of "Optional[Match[str]]" has no attribute "group"
- dragonchain/lib/authorization.py:317:25: error: Item "None" of "Optional[Match[str]]" has no attribute "group"
- dragonchain/lib/authorization.py:340:31: error: Item "None" of "Optional[Match[str]]" has no attribute "group"
aiohttp (https://github.com/aio-libs/aiohttp)
+ aiohttp/multipart.py:812: error: Invalid regex: bad character range \\-. [misc]
streamlit (https://github.com/streamlit/streamlit)
+ lib/streamlit/git_util.py: note: In member "get_repo_info" of class "GitRepo":
+ lib/streamlit/git_util.py:163:13: error: Statement is unreachable [unreachable]
|
|
This pull request adds a plugin to make mypy infer more precise types when grabbing regex groups: the plugin will when possible analyze original regex to deduce whether a given group is required or not.
To track this information, I added in an optional 'metadata' dict field to the Instance class, similar to the metadata dict for plugins in TypeInfos. We skip serializing this dict if it does not contain any data.
A limitation of this plugin is that both the pattern and the match variables must be declared to be final. Otherwise, we just default to using whatever types are defined in typeshed.
This is because we set and erase the metadata field in exactly the same way we set and erase the
last_known_value
field in Instances: both kinds of info are "transient" and are unsafe to keep around if the variable reference is mutable.This limitation does end up limiting the usefulness of this plugin to some degree: it won't support common patterns like the below, since variables aren't allowed to be declared final inside loops:
Possibly we can remove this limitation by making mypy less aggressive about removing this transient info by tracking the "lifetime" of this sort of data in some way?
This pull request should mostly address #7363, though it's unclear if it really fully resolves it: we might want to do something about the limitation described above and re-tune typeshed first.
The other mostly unrelated change this PR makes is to refactor some of the helper functions in checker.py into typeops.py so I could use them more cleanly in the plugin.