Fix requires_python metadata + add repair metadata command #779

gerrod3 · 2024-12-12T08:16:05Z

fixes: #773

This also centralizes the parsing of metadata for PythonPackageContent into two functions:

parse_metadata: when the input data is from the PyPI json api
artifact_to_python_content_data: when the input data is directly from the artifact itself

I'll probably rename parse_metadata in a future PR since its real purpose is to prepare the python content data for saving (really only used in the sync pipeline right now). All the other ways python content is made, pulp upload, pypi (twine) upload, pull-through caching, now use artifact_to_python_content_data which should make it easier to add/remove metadata from the model.

mdellweg · 2024-12-13T10:50:35Z

pulp_python/app/management/commands/repair-python-metadata.py

+    """Common list parsing for a string of hrefs/prns."""
+    h = rf"(?:{settings.API_ROOT}(?:[-_a-zA-Z0-9]+/)?api/v3/repositories/python/python/[-a-f0-9]+/)"
+    p = r"(?:prn:python\.pythonrepository:[-a-f0-9]+)"
+    r = rf"{h}|{p}"


I don't think "r" is technically needed here. But hey...

Have you heard about verbose regex though?

No I haven't, is there a better way to rewrite this? I'm so bad at regex...

https://docs.python.org/3/library/re.html#re.VERBOSE

mdellweg · 2024-12-13T10:53:55Z

pulp_python/app/management/commands/repair-python-metadata.py

+    h = rf"(?:{settings.API_ROOT}(?:[-_a-zA-Z0-9]+/)?api/v3/repositories/python/python/[-a-f0-9]+/)"
+    p = r"(?:prn:python\.pythonrepository:[-a-f0-9]+)"
+    r = rf"{h}|{p}"
+    return re.findall(r, value)


This will just ignore all the parts of the string that don't match either right?
I think we should whitespace split and match individually to be able to identify bad input.

Specifically since gibberish is currently handled as if not specified at all and that would be surprising to users.

Yeah it currently ignores all bad input. I'll change it.

Also think about re.compile when calling more often.

Ok, I changed it. Let me know if I need to change it some more

mdellweg · 2024-12-13T11:02:06Z

pulp_python/tests/functional/api/test_repair.py

+    assert content.version == "0.1"
+    assert content.packagetype == "sdist"
+    assert content.requires_python == ""  # technically null
+    assert content.author == "Austin Macdonald"


👋 @asmacdo

mdellweg · 2024-12-13T11:03:37Z

pulp_python/app/management/commands/repair-python-metadata.py

+    :param content: The PythonPackageContent queryset.
+    Return: number of content units that were repaired
+    """
+    # TODO: Add on_demand content repair?


This data repair is not meant to unblock a migration / update, right?
In that case I'd say a best effort approach is just fine. We never really own the on-demand content anyway.

No it's not meant to unblock, but I have a feeling that a lot of packages in users' repositories contain bad metadata. Like when we do a sync (which are default on-demand) currently all the non-latest versions of a package potentially get incorrect metadata because the majority of the available metadata (the info section) is from the latest package. There is a way to get the correct metadata for each version, but it adds an API call for each version, so I've never implemented it.

How bad is it? Is it documented?

It's not too hard, it's this api: https://docs.pypi.org/api/json/. You've seen it around, I've been using it in some of my scripts (oci-images check updates script for example)

fixes: pulp#773

mdellweg · 2024-12-17T13:34:41Z

pulp_python/app/management/commands/repair-python-metadata.py

+    for v in value.split(" ,"):
+        if v:
+            if match := r.match(v):


Let's parse even more user friendly:

Suggested change

for v in value.split(" ,"):

if v:

if match := r.match(v):

for v in value.split(","):

if v:

if match := r.match(v.strip()):

gerrod3 force-pushed the python-metadata-fix-script branch 4 times, most recently from e320362 to 08eeaba Compare December 12, 2024 22:41

gerrod3 marked this pull request as ready for review December 12, 2024 23:19

mdellweg reviewed Dec 13, 2024

View reviewed changes

Fix requires_python metadata + add repair metadata command

72f3cd3

fixes: pulp#773

gerrod3 force-pushed the python-metadata-fix-script branch from 08eeaba to 72f3cd3 Compare December 16, 2024 18:40

mdellweg reviewed Dec 17, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix requires_python metadata + add repair metadata command #779

Fix requires_python metadata + add repair metadata command #779

gerrod3 commented Dec 12, 2024 •

edited

Loading

mdellweg Dec 13, 2024

gerrod3 Dec 13, 2024

mdellweg Dec 13, 2024

mdellweg Dec 13, 2024

gerrod3 Dec 13, 2024

mdellweg Dec 13, 2024

gerrod3 Dec 16, 2024

mdellweg Dec 13, 2024

mdellweg Dec 13, 2024

gerrod3 Dec 13, 2024

mdellweg Dec 13, 2024

gerrod3 Dec 13, 2024

mdellweg Dec 17, 2024

Fix requires_python metadata + add repair metadata command #779

Are you sure you want to change the base?

Fix requires_python metadata + add repair metadata command #779

Conversation

gerrod3 commented Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gerrod3 commented Dec 12, 2024 •

edited

Loading