-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some basic subscript type inference #13562
Conversation
} | ||
} | ||
// TODO some slices on strings should resolve to type errors rather than literals | ||
(Type::StringLiteral(_), _) => Type::LiteralString, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the general philosophy around when we should error in a case like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few things to note here.
Issues with LiteralString
According to a strict reading of PEP 675, I don't think this is a legal place where we can infer LiteralString
, since the PEP states:
As per the Rationale, we also infer
LiteralString
in the following cases:[...]
- Literal-preserving methods: In Appendix C, we have provided an exhaustive list of
str
methods that preserve theLiteralString
type.
str.__getitem__
is not listed in Appendix C of the PEP, so I suppose that means that we're not allowed to infer LiteralString
for x
, y
or z
in a snippet like this:
def foo(a: Literal["foo"], b: LiteralString, c: int):
x = a[c]
y = b[42]
z = b[c]
I'm not sure what the status of Appendix C is, though... it doesn't seem to have been copied into the relevant section of the typing spec, so perhaps it isn't canonical anymore? Or maybe that was just an omission when writing the spec? There's also the question of typeshed's stubs: typeshed doesn't include a LiteralString
overload in its stub for str.__getitem__
, although it includes LiteralString
overloads for many of its other str
methods. I'm not sure if there's a reason for that or not -- even if not, it may be desirable to be in sync with typeshed here.
For now I think I'd just delete this branch -- continuing to fallback to Unknown
is ~fine for now.
Emitting diagnostics
In general, if we can't infer that an operation creates a Literal
type (whether that's LiteralString
or StringLiteral
), we should fall back to typeshed's stubs. Here that means falling back to builtins.str.__getitem__
-- you'd do something like value_ty.to_meta_type(db).member(db, "__getitem__").call(&[value_ty, slice_ty])
to figure out what the type of the subscript expression is. Except there are some complications ;) Take a look at Type::iterate
for a more complete example of how we lookup dunder methods elsewhere, try to call them, and emit a diagnostic if the dunder method eithere doesn't exist or isn't callable.
ruff/crates/red_knot_python_semantic/src/types.rs
Lines 629 to 684 in 5118166
/// Given the type of an object that is iterated over in some way, | |
/// return the type of objects that are yielded by that iteration. | |
/// | |
/// E.g., for the following loop, given the type of `x`, infer the type of `y`: | |
/// ```python | |
/// for y in x: | |
/// pass | |
/// ``` | |
fn iterate(self, db: &'db dyn Db) -> IterationOutcome<'db> { | |
if let Type::Tuple(tuple_type) = self { | |
return IterationOutcome::Iterable { | |
element_ty: UnionType::from_elements(db, &**tuple_type.elements(db)), | |
}; | |
} | |
// `self` represents the type of the iterable; | |
// `__iter__` and `__next__` are both looked up on the class of the iterable: | |
let iterable_meta_type = self.to_meta_type(db); | |
let dunder_iter_method = iterable_meta_type.member(db, "__iter__"); | |
if !dunder_iter_method.is_unbound() { | |
let CallOutcome::Callable { | |
return_ty: iterator_ty, | |
} = dunder_iter_method.call(db, &[]) | |
else { | |
return IterationOutcome::NotIterable { | |
not_iterable_ty: self, | |
}; | |
}; | |
let dunder_next_method = iterator_ty.to_meta_type(db).member(db, "__next__"); | |
return dunder_next_method | |
.call(db, &[]) | |
.return_ty(db) | |
.map(|element_ty| IterationOutcome::Iterable { element_ty }) | |
.unwrap_or(IterationOutcome::NotIterable { | |
not_iterable_ty: self, | |
}); | |
} | |
// Although it's not considered great practice, | |
// classes that define `__getitem__` are also iterable, | |
// even if they do not define `__iter__`. | |
// | |
// TODO(Alex) this is only valid if the `__getitem__` method is annotated as | |
// accepting `int` or `SupportsIndex` | |
let dunder_get_item_method = iterable_meta_type.member(db, "__getitem__"); | |
dunder_get_item_method | |
.call(db, &[]) | |
.return_ty(db) | |
.map(|element_ty| IterationOutcome::Iterable { element_ty }) | |
.unwrap_or(IterationOutcome::NotIterable { | |
not_iterable_ty: self, | |
}) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with the pedantic reading of PEP 675 (or typeshed) as a rationale for removing this branch. Many useful things are simply overlooked. If a more precise type inference is clearly correct in terms of the runtime semantics, and there's nothing in the typing spec or conformance suite explicitly prohibiting it, then I think we should feel totally free to go ahead and do it.
It is clearly correct wrt the runtime semantics to infer LiteralString
as the result of indexing into a LiteralString
or StringLiteral
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, given the security rationale for LiteralString
, there may be an argument that untrusted-user-supplied indices could allow turning a "safe" LiteralString
into an "unsafe" one via careful cropping. I have mixed feelings about this, because there's also type-system value to LiteralString
(we know it is really a built-in str
, not a subclass), and in type-system terms, the inference of LiteralString
is correct. And the security issue seems fairly contrived. But that might be why inference of LiteralString
on __getitem__
is not in the PEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is clearly correct wrt the runtime semantics to infer
LiteralString
as the result of indexing into aLiteralString
orStringLiteral
.
I agree, but the fact that PEP-675 is so clear that the list of string methods in Appendix C is valid is meant to be "exhaustive" made me worried that there was something I was misunderstanding :-) and until recently, PEP-675 was the spec -- it's not clear to me whether Appendix C was deliberately kept out of the typing spec or not.
This branch was anyway incorrect -- StringLiteral
cannot be indexed into with objects of arbitraty types, which is what this branch allowed before Charlie removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to clarify... We need to add a branch for __getitem__
(in general -- I plan to do this in a separate PR), and slices into StringLiteral
(with non-int literal indexes) should be handled by that branch? Or do we want to resolve to LiteralString
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further thought, I don't really buy the possible security argument I mentioned above for not inferring LiteralString
: you can easily construct many other cases where the contents of a LiteralString
, though always originating from a literal string in the source code, could be influenced in some way by untrusted user input (e.g. determining whether some concatenation does or doesn't occur.)
But that said, I don't think this is critical, and it is probably best to just rely on typeshed for this. So we could at some point consider proposing a change to typeshed with a LiteralString
overload for str.__getitem__
, but I don't think it's worth an extra special case in red-knot to diverge from typeshed here.
So yeah, I would say let's just add the branch that uses the return type from __getitem__
, as Alex suggested. (And yeah, fine for that to be a separate follow-up if you prefer.
(I suspect, but don't know, that the reason Appendix C isn't in the type spec is that it was considered to already be reflected in the typeshed change.)
|
} | ||
} | ||
// TODO some slices on strings should resolve to type errors rather than literals | ||
(Type::StringLiteral(_), _) => Type::LiteralString, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a few things to note here.
Issues with LiteralString
According to a strict reading of PEP 675, I don't think this is a legal place where we can infer LiteralString
, since the PEP states:
As per the Rationale, we also infer
LiteralString
in the following cases:[...]
- Literal-preserving methods: In Appendix C, we have provided an exhaustive list of
str
methods that preserve theLiteralString
type.
str.__getitem__
is not listed in Appendix C of the PEP, so I suppose that means that we're not allowed to infer LiteralString
for x
, y
or z
in a snippet like this:
def foo(a: Literal["foo"], b: LiteralString, c: int):
x = a[c]
y = b[42]
z = b[c]
I'm not sure what the status of Appendix C is, though... it doesn't seem to have been copied into the relevant section of the typing spec, so perhaps it isn't canonical anymore? Or maybe that was just an omission when writing the spec? There's also the question of typeshed's stubs: typeshed doesn't include a LiteralString
overload in its stub for str.__getitem__
, although it includes LiteralString
overloads for many of its other str
methods. I'm not sure if there's a reason for that or not -- even if not, it may be desirable to be in sync with typeshed here.
For now I think I'd just delete this branch -- continuing to fallback to Unknown
is ~fine for now.
Emitting diagnostics
In general, if we can't infer that an operation creates a Literal
type (whether that's LiteralString
or StringLiteral
), we should fall back to typeshed's stubs. Here that means falling back to builtins.str.__getitem__
-- you'd do something like value_ty.to_meta_type(db).member(db, "__getitem__").call(&[value_ty, slice_ty])
to figure out what the type of the subscript expression is. Except there are some complications ;) Take a look at Type::iterate
for a more complete example of how we lookup dunder methods elsewhere, try to call them, and emit a diagnostic if the dunder method eithere doesn't exist or isn't callable.
ruff/crates/red_knot_python_semantic/src/types.rs
Lines 629 to 684 in 5118166
/// Given the type of an object that is iterated over in some way, | |
/// return the type of objects that are yielded by that iteration. | |
/// | |
/// E.g., for the following loop, given the type of `x`, infer the type of `y`: | |
/// ```python | |
/// for y in x: | |
/// pass | |
/// ``` | |
fn iterate(self, db: &'db dyn Db) -> IterationOutcome<'db> { | |
if let Type::Tuple(tuple_type) = self { | |
return IterationOutcome::Iterable { | |
element_ty: UnionType::from_elements(db, &**tuple_type.elements(db)), | |
}; | |
} | |
// `self` represents the type of the iterable; | |
// `__iter__` and `__next__` are both looked up on the class of the iterable: | |
let iterable_meta_type = self.to_meta_type(db); | |
let dunder_iter_method = iterable_meta_type.member(db, "__iter__"); | |
if !dunder_iter_method.is_unbound() { | |
let CallOutcome::Callable { | |
return_ty: iterator_ty, | |
} = dunder_iter_method.call(db, &[]) | |
else { | |
return IterationOutcome::NotIterable { | |
not_iterable_ty: self, | |
}; | |
}; | |
let dunder_next_method = iterator_ty.to_meta_type(db).member(db, "__next__"); | |
return dunder_next_method | |
.call(db, &[]) | |
.return_ty(db) | |
.map(|element_ty| IterationOutcome::Iterable { element_ty }) | |
.unwrap_or(IterationOutcome::NotIterable { | |
not_iterable_ty: self, | |
}); | |
} | |
// Although it's not considered great practice, | |
// classes that define `__getitem__` are also iterable, | |
// even if they do not define `__iter__`. | |
// | |
// TODO(Alex) this is only valid if the `__getitem__` method is annotated as | |
// accepting `int` or `SupportsIndex` | |
let dunder_get_item_method = iterable_meta_type.member(db, "__getitem__"); | |
dunder_get_item_method | |
.call(db, &[]) | |
.return_ty(db) | |
.map(|element_ty| IterationOutcome::Iterable { element_ty }) | |
.unwrap_or(IterationOutcome::NotIterable { | |
not_iterable_ty: self, | |
}) | |
} |
Okay thanks, will re-request review once I've addressed feedback. |
Actually, it should be fine to merge in its current state. I'll look at doing the dunder fallback and diagnostics in a separate PR to keep this small. |
ba9ee03
to
13561b1
Compare
3e425a7
to
ef5e252
Compare
Added diagnostics, though unsure if there are established patterns around the rule names, granularity, etc. |
It's a bit of a wild west right now, I wouldn't worry about that too much! I'm sure we'll clean them up at some point (and make the diagnostics less stringly typed etc.). Micha's proposals he shared during the on-site go some way to addressing this. |
Sounds good, I figured as much :) |
ef5e252
to
8b9710c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you! One minor naming nit, and the issue of negative indexing.
// TODO handle variable-length tuples | ||
(Type::Tuple(tuple_ty), Type::IntLiteral(int)) => { | ||
let elements = tuple_ty.elements(self.db); | ||
usize::try_from(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python allows negative indexing, which indexes from the end of the container:
>>> t = (1, 2, 3)
>>> t[-1]
3
So we should handle that here. I would say we don't have to do it in this PR, and we can just add a TODO, but I don't like that in the meantime this would mean we'd wrongly emit an out-of-bounds error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll handle it here, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM.
I think there would be a possible implementation where we extract a helper function for the actual indexing-or-None for each type, and then the negative case could just add .len()
to the index and try that function. But there's really not that much duplication in this version, so I'm not sure that would be worth it, happy with this as-is.
Thanks for the thorough reviews, much appreciated! I'll look at the |
## Summary Follow-up to #13562, to add support for "arbitrary" subscript operations.
Summary
Just for tuples and strings -- the easiest cases. I think most of the rest require generic support?