-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add metadata validation #1562
Conversation
As suggested in #1390 (comment) I've added the information about which field should satisfy which validation to the new I still have to write the test for the new module. This is meant as WIP, so I can have the code peer-reviewed more quickly. (Thank you for your understanding!) |
Looks like I have to make the tests pass for Python 2. Also, the Exception details son't seem to be printed out in Python 2 -- while they do in Python 3, e.g. E setuptools.validation.InvalidMetadataError: [('provides_extras', AssertionError("is of type <class 'set'> but should be of type list(str)",))] versus Python 2: E InvalidMetadataError
setuptools/validation.py:97: InvalidMetadataError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the idea of adding any sort of field validation to the public interface of setuptools
, and I see no reason to wrap the DistributionMetadata
in a validation.Metadata
class.
If we're going to continue on this path, we should put validators in a setuptools._validation
module. I would start with adding a validate_metadata
function that acts on DistributionMetadata
objects, though in the long run we may need to tweak how precisely it works.
Also, all that said, I think @di's suggestion of putting a metadata validator in packaging
instead of in setuptools
is a good idea. This is a function that all build backends will likely need.
@@ -55,17 +56,41 @@ def get_metadata_version(dist_md): | |||
def write_pkg_file(self, file): | |||
"""Write the PKG-INFO format data to a file object. | |||
""" | |||
metadata = Metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you wrapping DistributionMetadata
in a Metadata
object? Even if there were not other problems with this approach, I would think that the validator could work just as well on DistributionMetadata
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. You're right.
On the other hand, consider the validation functionality suggested by @di: it requires that you pass a dictionary to the validation class constructor. How would you solve this with the DistributionMetadata
constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @di was referring to more generic functionality for a package other than setuptools
, so it depends on the details there. One option is to just generate a non-validated PKG-INFO and pass it to a function that validates PKG-INFO
files.
Another is to have an intermediate format that represents PKG-INFO
as JSON (thus retaining some type information and avoiding the problem where content and format are indistinguishable) which is convertable by packaging
into a PKG-INFO
file. packaging
could then validate the JSON file before a PKG-INFO
file is generated.
If we're already trying to validate a DistributionMetadata
file, then I would just validate the results of the actual getter operations.
setattr(self, key, kwargs[key]) | ||
self.fields += [key] | ||
|
||
def validate(self, throw_exception=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need for throw_exception
, if people want to suppress the exceptions from this function, they can do:
try:
metadata.validate()
invalid = False
except InvalidMetadataError:
invalid = True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that @di suggested to add validation functionality that behaves like validate().errors
, so that it integrates nicely into Warehouse code.
I felt that throwing an exception by default would make code more clumsy. It's just not beautiful. -- I wouldn't want to miss that possibility in general, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note that @di suggested to add validation functionality that behaves like validate().errors, so that it integrates nicely into Warehouse code.
He suggested no such thing. He suggested that something like the warehouse functionality go into a different project, pypa/packaging, not this one. I think you may have misunderstood him.
I felt that throwing an exception by default would make code more clumsy. It's just not beautiful. -- I wouldn't want to miss that possibility in general, though.
This is in general not an example of good design, because there are now two different error handling pathways. Additionally, the Pythonic way to signal an error condition is to raise an exception. It is not clumsy and in fact it leads to a lot of clumsy code in downstream consumers.
The reason is that in most cases, the functions calling your function can do nothing about an error except abort early and notify the user, if you raise an exception in error conditions, anyone who isn't going to try to recover from the error can just use your function as if it always works because the exception you raise will bubble up the stack until someone either catches it or the program is aborted and the user is notified of the error. If you return an error code instead, you are relying on your consumers to know that your function can either return an error code of some sort or not and raise an error or a handler.
"""A string that can contain newline characters""" | ||
if val is None: | ||
return | ||
assert isinstance(val, str), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use assert
for control flow. It is disabled when python is run optimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be similarly elegant? I don't want to do an if foo: raise Bar()
in all that many places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert
is not intended to be an "elegant" shorthand for raising an exception, it's a way of programmatically making assertions about how your code should work. You use it for things like enforcing a function's contract or documenting assumptions made by the programmer (which are then validated at runtime).
In this specific case, you cannot use it because it will not work. Assertions are to be treated as optional statements and if you run Python in optimized mode, they will simply not run at all, for example:
$ python -Oc "assert False"
$ python -c "assert False"
Traceback (most recent call last):
File "<string>", line 1, in <module>
AssertionError
In this particular case, I would either raise ValueError
or convert validators to functions returning a boolean, depending on the details. I'm not sure it's worth working out the details in this case, though, because as mentioned elsewhere this functionality rightly belongs in another library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bittner Please update the code and use raise exception. Assert should be used only inside tests and I am sure that some linters are even able to identify accidental use outside tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssbarnea I think at this point this PR needs a much bigger overhaul, and most of it may need to live in this repo. Probably best to hold off on cleaning up this specific code until the overhaul is done.
Agree with @pganssle here, there should be no actual validation logic added to |
I agree with you. There are some things I still need to clarify for myself, though:
TL;DR
|
@bittner Regarding point 4, there are many points you may want to validate the metdata:
The metadata fields and formats are a standard, and as such it is likely that anyone who uses it may want to be able to know the difference between metadata that is compliant with the standard and not compliant. We could write separate validation logic for each of these (and possibly one or more of them will re-implement validation), but it may be a needless duplication of effort, and you may end up with metadata producers generating something that metadata consumers consider invalid if one is more strict than the other. Regarding your questions:
This depends on the nature of the common validator, but I think so, yes.
I don't know what you mean by "validation business logic", but I'm guessing you mean the If it were me designing it, I would probably design an interface that looks like this: def validate_metadata(metadata: Dict[str, str]):
"""Function to validate PKG-INFO metadata
Raises `InvalidMetadataError` for invalid metadata.
"""
metadata_version = metadata.get("Metadata-Version", None)
if metadata_version not in VALID_VERSIONS:
raise InvalidMetadataError("No valid Metadata-Version found!")
for key, value in metadata.items():
validate = _get_key_validator(key, metadata_version)
validate(value)
If such an interface were added to a third-party package, Note that in this design, you can easily validate a JSON file containing the metadata by reading it into a dictionary with |
Any chance to fix this? |
I don't think it will be from this PR. I would have loved to get this fixed. It doesn't depend on me now, I'm afraid. 😞 @ssbarnea Go ahead in pypa/packaging#147 asking for progress, please! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you could update this it would be great as we really need to add metadata validation.
"""A string that can contain newline characters""" | ||
if val is None: | ||
return | ||
assert isinstance(val, str), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bittner Please update the code and use raise exception. Assert should be used only inside tests and I am sure that some linters are even able to identify accidental use outside tests.
This contribution seems to have stalled, at an impasse due to lack of clarity on where the validation should be implemented or whose implementation should be used. I'm uninterested in merging functionality here that's going to be implemented redundantly in another package. As a result, I'm going to defer this merge indefinitely, though I'm happy to revive the PR or review a new one at such a point that a (shared) implementation is accepted. Thanks @bittner for your contribution, which will remain here even if not merged. |
Well, at least I tried. TBH, I'm a bit disappointed about how this PR went. I would have continued elsewhere if directed properly. I'm keen on finding the "right place" and implement an elegant, sustainable and beautiful solution (honoring PEP 20 and Clean Code). Clearly, you all know the code base better but you could have leveraged my motivation. Instead, I found myself blocked with some other's intention of "I'll do it myself" (which doesn't seem to have resulted in a working solution until today, half a year after). This is not meant as a personal rant. I guess, most of us do that kind of programming in their spare time, sacrificing their families. Honor this! And, please, open yourselves to contributions from people that are not yet part of "your club". (At least that was my, clearly personal, impression. Sorry about that.) |
With all due respect but I seen passing the dead-cat (metadata-validation) between several projects and this is really sad, for multiple reasons: discourage new contributors and fails to implement a feature that is highly desired and needed. I can understand that the maintainers are worried about added complexity and they prefer to focus on other things but once they realise that metadata needs to be validated by another project, they should be the first to setup this placeholder project and link to it. The other approach which is build it and we will use it if we like is not really constructive, as is used in many cases as an excuse to do nothing (not saying is the case here). |
@bittner Really sorry that you feel like your time/motivation was wasted here, I definitely feel like that's my fault. Unfortunately I haven't had much time (until now) to work on this myself, or help someone else work on it. If you're interested, there will surely still be work to be done here. I posted an update on pypa/packaging#147 outlining what's been holding this up and the current status. @ssbarnea I'm not sure what you mean. At least on this issue, I think @pganssle and I have agreed that metadata validation should exist in |
@di Super! I am happy to hear we reached an agreement regarding where this feature will land. |
@ssbarnea To be clear, there was never any disagreement here. The submitter merely misunderstood what was being asked of them. |
Yeah, sure. |
Summary of changes
Adds metadata validation that aborts packaging with invalid package information.
Closes #1390
Pull Request Checklist