Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add metadata validation #1562

Closed
wants to merge 1 commit into from

Conversation

bittner
Copy link
Contributor

@bittner bittner commented Oct 29, 2018

Summary of changes

Adds metadata validation that aborts packaging with invalid package information.

Closes #1390

Pull Request Checklist

  • Changes have tests
  • News fragment added in changelog.d. See documentation for details

@bittner
Copy link
Contributor Author

bittner commented Oct 29, 2018

As suggested in #1390 (comment) I've added the information about which field should satisfy which validation to the new validation module directly.

I still have to write the test for the new module. This is meant as WIP, so I can have the code peer-reviewed more quickly. (Thank you for your understanding!)

@bittner
Copy link
Contributor Author

bittner commented Oct 29, 2018

Looks like I have to make the tests pass for Python 2.

Also, the Exception details son't seem to be printed out in Python 2 -- while they do in Python 3, e.g.

E   setuptools.validation.InvalidMetadataError: [('provides_extras', AssertionError("is of type <class 'set'> but should be of type list(str)",))]

versus Python 2:

E   InvalidMetadataError

    setuptools/validation.py:97: InvalidMetadataError

Copy link
Member

@pganssle pganssle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of adding any sort of field validation to the public interface of setuptools, and I see no reason to wrap the DistributionMetadata in a validation.Metadata class.

If we're going to continue on this path, we should put validators in a setuptools._validation module. I would start with adding a validate_metadata function that acts on DistributionMetadata objects, though in the long run we may need to tweak how precisely it works.

Also, all that said, I think @di's suggestion of putting a metadata validator in packaging instead of in setuptools is a good idea. This is a function that all build backends will likely need.

@@ -55,17 +56,41 @@ def get_metadata_version(dist_md):
def write_pkg_file(self, file):
"""Write the PKG-INFO format data to a file object.
"""
metadata = Metadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you wrapping DistributionMetadata in a Metadata object? Even if there were not other problems with this approach, I would think that the validator could work just as well on DistributionMetadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. You're right.

On the other hand, consider the validation functionality suggested by @di: it requires that you pass a dictionary to the validation class constructor. How would you solve this with the DistributionMetadata constructor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @di was referring to more generic functionality for a package other than setuptools, so it depends on the details there. One option is to just generate a non-validated PKG-INFO and pass it to a function that validates PKG-INFO files.

Another is to have an intermediate format that represents PKG-INFO as JSON (thus retaining some type information and avoiding the problem where content and format are indistinguishable) which is convertable by packaging into a PKG-INFO file. packaging could then validate the JSON file before a PKG-INFOfile is generated.

If we're already trying to validate a DistributionMetadata file, then I would just validate the results of the actual getter operations.

setattr(self, key, kwargs[key])
self.fields += [key]

def validate(self, throw_exception=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need for throw_exception, if people want to suppress the exceptions from this function, they can do:

try:
    metadata.validate()
    invalid = False
except InvalidMetadataError:
    invalid = True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that @di suggested to add validation functionality that behaves like validate().errors, so that it integrates nicely into Warehouse code.

I felt that throwing an exception by default would make code more clumsy. It's just not beautiful. -- I wouldn't want to miss that possibility in general, though.

Copy link
Member

@pganssle pganssle Nov 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that @di suggested to add validation functionality that behaves like validate().errors, so that it integrates nicely into Warehouse code.

He suggested no such thing. He suggested that something like the warehouse functionality go into a different project, pypa/packaging, not this one. I think you may have misunderstood him.

I felt that throwing an exception by default would make code more clumsy. It's just not beautiful. -- I wouldn't want to miss that possibility in general, though.

This is in general not an example of good design, because there are now two different error handling pathways. Additionally, the Pythonic way to signal an error condition is to raise an exception. It is not clumsy and in fact it leads to a lot of clumsy code in downstream consumers.

The reason is that in most cases, the functions calling your function can do nothing about an error except abort early and notify the user, if you raise an exception in error conditions, anyone who isn't going to try to recover from the error can just use your function as if it always works because the exception you raise will bubble up the stack until someone either catches it or the program is aborted and the user is notified of the error. If you return an error code instead, you are relying on your consumers to know that your function can either return an error code of some sort or not and raise an error or a handler.

"""A string that can contain newline characters"""
if val is None:
return
assert isinstance(val, str), \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use assert for control flow. It is disabled when python is run optimized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be similarly elegant? I don't want to do an if foo: raise Bar() in all that many places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert is not intended to be an "elegant" shorthand for raising an exception, it's a way of programmatically making assertions about how your code should work. You use it for things like enforcing a function's contract or documenting assumptions made by the programmer (which are then validated at runtime).

In this specific case, you cannot use it because it will not work. Assertions are to be treated as optional statements and if you run Python in optimized mode, they will simply not run at all, for example:

$ python -Oc "assert False"
$ python -c "assert False"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError

In this particular case, I would either raise ValueError or convert validators to functions returning a boolean, depending on the details. I'm not sure it's worth working out the details in this case, though, because as mentioned elsewhere this functionality rightly belongs in another library.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bittner Please update the code and use raise exception. Assert should be used only inside tests and I am sure that some linters are even able to identify accidental use outside tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssbarnea I think at this point this PR needs a much bigger overhaul, and most of it may need to live in this repo. Probably best to hold off on cleaning up this specific code until the overhaul is done.

@di
Copy link
Member

di commented Oct 29, 2018

Agree with @pganssle here, there should be no actual validation logic added to setuptools unless it is somehow setuptools-specific.

@bittner
Copy link
Contributor Author

bittner commented Nov 1, 2018

I agree with you. There are some things I still need to clarify for myself, though:

  1. setuptools is just a "tool" to generate metadata to publish a package (among other things)
  2. distutils is just another tool (like setuptools, see above) -- let's ignore the unlucky monkey-patch inheritance nature of setuptools for now
  3. distutils is mentioned as a deprecation candidate in other discussions
  4. other places, such as Warehouse, may also require validating metadata (i.e. the validation code)
  5. if cleanly implemented, there should probably be a separate, independent module, specifically dedicated to just the metadata used in Python or for Python modules, and their validation.

TL;DR

@pganssle
Copy link
Member

pganssle commented Nov 1, 2018

@bittner
Regarding points 1, 2 and 3, setuptools is an extension of distutils. Both are tools for building python packages - i.e. converting the source repository into something distributable/installable. distutils is basically deprecated in the sense that it is supplanted by setuptools and likely distutils will move into setuptools. PEP 517 adds a mechanism for freeing the tight coupling between installer tools like pip and distutils and setuptools. This will allow for the user of other build backends and in general is a sounder design because it's based on well-defined interfaces.

Regarding point 4, there are many points you may want to validate the metdata:

  1. Immediately before or after generation (setuptools, flit, other build backends)
  2. Before upload (twine)
  3. Before accepting it to a package index (warehouse, bandersnatch)
  4. Before installation (pip)

The metadata fields and formats are a standard, and as such it is likely that anyone who uses it may want to be able to know the difference between metadata that is compliant with the standard and not compliant. We could write separate validation logic for each of these (and possibly one or more of them will re-implement validation), but it may be a needless duplication of effort, and you may end up with metadata producers generating something that metadata consumers consider invalid if one is more strict than the other.

Regarding your questions:

Is the write_pkg_file function the right place to add the use of metadata validation, like implemented in this PR?

This depends on the nature of the common validator, but I think so, yes.

Where should the validation business logic be added -- if not in distutils.dist.DistributionMetadata (as of item 3 above)? Should I move the validation module I introduced with this PR to the packaging package?

I don't know what you mean by "validation business logic", but I'm guessing you mean the validate method? If so, I would say it belongs in another repo. Possibly pypa/packaging, but I would open an issue there to figure out if that's something they want to include or if we should possibly make it a completely separate package, and if so what the details would be.

If it were me designing it, I would probably design an interface that looks like this:

def validate_metadata(metadata: Dict[str, str]):
    """Function to validate PKG-INFO metadata

    Raises `InvalidMetadataError` for invalid metadata.
    """
    metadata_version = metadata.get("Metadata-Version", None)
    if metadata_version not in VALID_VERSIONS:
        raise InvalidMetadataError("No valid Metadata-Version found!")

    for key, value in metadata.items():
        validate = _get_key_validator(key, metadata_version)
        validate(value)

metadata would be a dictionary mapping the keys from the appropriate version of the Core Metadata Specification to a function that checks whether value is a valid value for that key in that version of the metadata and throws InvalidMetadataError otherwise.

If such an interface were added to a third-party package, write_pkg_info would need to be modified such that it generates a dictionary (possibly ordered) of key-value pairs, then writes the key-value pairs to file in the PKG-INFO format only at the end of the function. This dictionary could then be passed to the validate_metadata function.

Note that in this design, you can easily validate a JSON file containing the metadata by reading it into a dictionary with json.load.

@ssbarnea
Copy link

Any chance to fix this?

@bittner
Copy link
Contributor Author

bittner commented Dec 31, 2018

Any chance to fix this?

I don't think it will be from this PR. I would have loved to get this fixed. It doesn't depend on me now, I'm afraid. 😞

@ssbarnea Go ahead in pypa/packaging#147 asking for progress, please!

Copy link

@ssbarnea ssbarnea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could update this it would be great as we really need to add metadata validation.

"""A string that can contain newline characters"""
if val is None:
return
assert isinstance(val, str), \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bittner Please update the code and use raise exception. Assert should be used only inside tests and I am sure that some linters are even able to identify accidental use outside tests.

@jaraco
Copy link
Member

jaraco commented Mar 21, 2020

This contribution seems to have stalled, at an impasse due to lack of clarity on where the validation should be implemented or whose implementation should be used. I'm uninterested in merging functionality here that's going to be implemented redundantly in another package. As a result, I'm going to defer this merge indefinitely, though I'm happy to revive the PR or review a new one at such a point that a (shared) implementation is accepted.

Thanks @bittner for your contribution, which will remain here even if not merged.

@jaraco jaraco closed this Mar 21, 2020
@bittner
Copy link
Contributor Author

bittner commented Mar 23, 2020

Well, at least I tried. TBH, I'm a bit disappointed about how this PR went. I would have continued elsewhere if directed properly.

I'm keen on finding the "right place" and implement an elegant, sustainable and beautiful solution (honoring PEP 20 and Clean Code). Clearly, you all know the code base better but you could have leveraged my motivation. Instead, I found myself blocked with some other's intention of "I'll do it myself" (which doesn't seem to have resulted in a working solution until today, half a year after).

This is not meant as a personal rant. I guess, most of us do that kind of programming in their spare time, sacrificing their families. Honor this! And, please, open yourselves to contributions from people that are not yet part of "your club". (At least that was my, clearly personal, impression. Sorry about that.)

@ssbarnea
Copy link

With all due respect but I seen passing the dead-cat (metadata-validation) between several projects and this is really sad, for multiple reasons: discourage new contributors and fails to implement a feature that is highly desired and needed.

I can understand that the maintainers are worried about added complexity and they prefer to focus on other things but once they realise that metadata needs to be validated by another project, they should be the first to setup this placeholder project and link to it.

The other approach which is build it and we will use it if we like is not really constructive, as is used in many cases as an excuse to do nothing (not saying is the case here).

@di
Copy link
Member

di commented Mar 24, 2020

@bittner Really sorry that you feel like your time/motivation was wasted here, I definitely feel like that's my fault. Unfortunately I haven't had much time (until now) to work on this myself, or help someone else work on it. If you're interested, there will surely still be work to be done here. I posted an update on pypa/packaging#147 outlining what's been holding this up and the current status.

@ssbarnea I'm not sure what you mean. At least on this issue, I think @pganssle and I have agreed that metadata validation should exist in packaging, and this has been the plan from PyPI's perspective as well for quite a while.. This isn't a case of "setuptools maintainers" saying "packaging maintainers" should do it instead... we're all working on the same thing here.

@ssbarnea
Copy link

@di Super! I am happy to hear we reached an agreement regarding where this feature will land.

@pganssle
Copy link
Member

@ssbarnea To be clear, there was never any disagreement here. The submitter merely misunderstood what was being asked of them.

@bittner
Copy link
Contributor Author

bittner commented Mar 25, 2020

The submitter merely misunderstood what was being asked of them.

Yeah, sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Newlines in the description field produce a malformed PKG-INFO
5 participants