Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix requires_python metadata + add repair metadata command #779

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gerrod3
Copy link
Contributor

@gerrod3 gerrod3 commented Dec 12, 2024

fixes: #773

This also centralizes the parsing of metadata for PythonPackageContent into two functions:

  • parse_metadata: when the input data is from the PyPI json api
  • artifact_to_python_content_data: when the input data is directly from the artifact itself

I'll probably rename parse_metadata in a future PR since its real purpose is to prepare the python content data for saving (really only used in the sync pipeline right now). All the other ways python content is made, pulp upload, pypi (twine) upload, pull-through caching, now use artifact_to_python_content_data which should make it easier to add/remove metadata from the model.

@gerrod3 gerrod3 force-pushed the python-metadata-fix-script branch 4 times, most recently from e320362 to 08eeaba Compare December 12, 2024 22:41
@gerrod3 gerrod3 marked this pull request as ready for review December 12, 2024 23:19
"""Common list parsing for a string of hrefs/prns."""
h = rf"(?:{settings.API_ROOT}(?:[-_a-zA-Z0-9]+/)?api/v3/repositories/python/python/[-a-f0-9]+/)"
p = r"(?:prn:python\.pythonrepository:[-a-f0-9]+)"
r = rf"{h}|{p}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think "r" is technically needed here. But hey...

Have you heard about verbose regex though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I haven't, is there a better way to rewrite this? I'm so bad at regex...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h = rf"(?:{settings.API_ROOT}(?:[-_a-zA-Z0-9]+/)?api/v3/repositories/python/python/[-a-f0-9]+/)"
p = r"(?:prn:python\.pythonrepository:[-a-f0-9]+)"
r = rf"{h}|{p}"
return re.findall(r, value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will just ignore all the parts of the string that don't match either right?
I think we should whitespace split and match individually to be able to identify bad input.

Specifically since gibberish is currently handled as if not specified at all and that would be surprising to users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it currently ignores all bad input. I'll change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also think about re.compile when calling more often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed it. Let me know if I need to change it some more

assert content.version == "0.1"
assert content.packagetype == "sdist"
assert content.requires_python == "" # technically null
assert content.author == "Austin Macdonald"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 @asmacdo

:param content: The PythonPackageContent queryset.
Return: number of content units that were repaired
"""
# TODO: Add on_demand content repair?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data repair is not meant to unblock a migration / update, right?
In that case I'd say a best effort approach is just fine. We never really own the on-demand content anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not meant to unblock, but I have a feeling that a lot of packages in users' repositories contain bad metadata. Like when we do a sync (which are default on-demand) currently all the non-latest versions of a package potentially get incorrect metadata because the majority of the available metadata (the info section) is from the latest package. There is a way to get the correct metadata for each version, but it adds an API call for each version, so I've never implemented it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How bad is it? Is it documented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not too hard, it's this api: https://docs.pypi.org/api/json/. You've seen it around, I've been using it in some of my scripts (oci-images check updates script for example)

@gerrod3 gerrod3 force-pushed the python-metadata-fix-script branch from 08eeaba to 72f3cd3 Compare December 16, 2024 18:40
Comment on lines +61 to +63
for v in value.split(" ,"):
if v:
if match := r.match(v):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's parse even more user friendly:

Suggested change
for v in value.split(" ,"):
if v:
if match := r.match(v):
for v in value.split(","):
if v:
if match := r.match(v.strip()):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

no requires_python delivered
2 participants