feature: cache downloaded wheel information #2276

mayeut · 2022-01-01T19:59:27Z

Add downloaded wheel information in the relevant JSON embed file to prevent additional downloads of the same wheel.

This requires to bump the JSON embed file version once again as I was lacking some tests cases in #2273.

I will add some inline comments for some choices that I made.

Fixes #2268

Thanks for contributing, make sure you address all the checklists (for details on how see

development documentation)!

ran the linter to address style issues (tox -e fix_lint)
wrote descriptive pull request text
ensured there are test(s) validating the fix
added news fragment in docs/changelog folder
updated/extended the documentation

mayeut · 2022-01-01T20:02:03Z

src/virtualenv/seed/wheels/periodic_update.py

+def add_wheel_to_update_log(wheel, for_py_version, app_data):
+    embed_update_log = app_data.embed_update_log(wheel.distribution, for_py_version)
+    logging.debug("adding %s information to %s", wheel.name, embed_update_log.file)
+    u_log = UpdateLog.from_dict(embed_update_log.read())
+    if any(version.filename == wheel.name for version in u_log.versions):
+        logging.warning("%s already present in %s", wheel.name, embed_update_log.file)
+        return
+    version = NewVersion(wheel.name, datetime.now(), release_date_for_wheel_path(wheel.path), "download")
+    u_log.versions.append(version)  # always write at the end for proper updates
+    embed_update_log.write(u_log.to_dict())
+
+


This has been added to periodic_update.py as it is the only place dealing with the embed_update_log (which is not used only for updates anymore).

mayeut · 2022-01-01T20:03:45Z

src/virtualenv/seed/wheels/periodic_update.py

@@ -250,16 +265,20 @@ def _run_do_update(app_data, distribution, embed_filename, for_py_version, perio
    now = datetime.now()
    if periodic:
        source = "periodic"
-        # mark everything not updated manually as source "periodic"


All the modifications in _run_do_update are what lead me to bump the JSON embed format version.

mayeut · 2022-01-01T20:07:29Z

src/virtualenv/seed/wheels/periodic_update.py

+    # we need to sort versions for the next update to work properly
+    u_log.versions.sort(reverse=True)


The log needs to be sorted. It was sorted by design before. That's not the case anymore.

Can you explain why is no longer the case?

This would remove the need to sort the list.

I don't mind sorting the log; why would you consider the other solution less risky?

Can you explain why is no longer the case?

because other sources logs are always appended rather than inserted.

I don't mind sorting the log; why would you consider the other solution less risky?

I'm feeling a bit hesitant to rely on the tuple ordering given:

def as_version_tuple(version): result = [] for part in version.split(".")[0:3]: try: result.append(int(part)) except ValueError: break if not result: raise ValueError(version) return tuple(result)

I feel that the inner try/catch is hiding something I did not think about.

I did not think of pre-release versions... In this specific sample, it does not matter much but it's not sorted correctly.

{ "completed": "2022-01-01T21:48:50.557509Z", "periodic": false, "started": "2022-01-01T21:48:40.230835Z", "versions": [ { "filename": "pip-21.3.1-py3-none-any.whl", "found_date": "2022-01-01T14:22:37.717326Z", "release_date": "2021-10-22T15:57:09.000000Z", "source": "manual" }, { "filename": "pip-20.3-py2.py3-none-any.whl", "found_date": "2022-01-01T21:46:25.286240Z", "release_date": "2020-11-30T12:47:49.000000Z", "source": "manual" }, { "filename": "pip-20.0-py2.py3-none-any.whl", "found_date": "2022-01-01T21:48:23.634459Z", "release_date": "2020-01-21T11:42:56.000000Z", "source": "manual" }, { "filename": "pip-20.3b1-py2.py3-none-any.whl", "found_date": "2022-01-01T21:44:30.798536Z", "release_date": "2020-10-31T19:25:12.000000Z", "source": "manual" } ] }

Further-more, a manual/periodic update should never lead to defaulting to pre-release version. This promise is broken (at least for manual updates).

Given an added test with a pre-release version downloaded as "pinned" version indeed broke manual updates, I went with the alternate implementation of _run_do_update which first splits the UpdateLog in 2 depending if it's coming from a previous update or another source.

mayeut · 2022-01-01T20:15:21Z

Edit: this is used by the latest commit

There's another way to do the update which might be a less bit risky for periodic updates:

Step A: filter the versions from the UpdateLog to only have manual/periodic sources on one hand (back to the previous problem) and other sources on the other hand.
Step B: run the update as was done before
Step C: filter-out newly found versions from the "other sources" list
Step D: concatenate the two versions list (the first half being from version update & the 2nd half from other sources)

This would remove the need to sort the list.

mayeut · 2022-01-01T20:24:58Z

Caveat:
There is a time penalty for first downloads compared to the previous version in order to fill the release_date.
Now that I'm writing this, I now realize it might not be necessary to fill this one which is only ever used by NewVersion.use when the source is periodic.

Edit: this is no longer the case with the latest push.

mayeut · 2022-01-01T20:36:45Z

@gaborbernat,

Writing things down makes me realize that there are still some things left to do so converting to draft.
Do you have any remarks to share about the comments I left ?

PS: while thinking about the time penalty I mentioned, I thought about caching PyPI requests on disk as well and use the etag header to know if actually pulling the data (setuptools JSON is about 1MB) was necessary. I deemed this was too heavy to implement for my simple use-case and not worth spending more time on this. I thought of another use-case which might benefit from this: users configuring --download as the default setting (checking if there was an update on PyPI should be much faster than starting a pip download. I guess asking PyPI this way is the first thing done by pip)

Add downloaded wheel information in the relevant JSON embed file to prevent additional downloads of the same wheel. This requires to bump the JSON embed file version.

mayeut · 2022-01-02T00:46:55Z

I rewrote the _run_do_update function as mentioned in #2276 (comment) and removed the time penalty I mentioned in #2276 (comment).

gaborbernat · 2022-01-02T14:42:38Z

tests/unit/seed/wheels/test_periodic_update.py

    assert read_dict.call_count == 1
    assert write.call_count == 1


-def test_download_manual_stop_at_first_usable(tmp_path, mocker, freezer):
+def test_download_periodic_stop_at_first_usable2(tmp_path, mocker, freezer):


Why the 2 suffix?

there's already a test_download_periodic_stop_at_first_usable test and I didn't see how to do this as a parametrize test easily so adding a suffix 2 on the second one.

Instead of magic numbers, I generally add to the test name what differentiates them 🤔 for future reference.

Signed-off-by: Bernát Gábor <[email protected]>

mayeut commented Jan 1, 2022

View reviewed changes

gaborbernat approved these changes Jan 1, 2022

View reviewed changes

mayeut commented Jan 1, 2022

View reviewed changes

mayeut marked this pull request as draft January 1, 2022 20:29

feature: cache downloaded wheel information

3937710

Add downloaded wheel information in the relevant JSON embed file to prevent additional downloads of the same wheel. This requires to bump the JSON embed file version.

mayeut force-pushed the download-cache branch from 3775c8d to 3937710 Compare January 2, 2022 00:39

mayeut marked this pull request as ready for review January 2, 2022 00:45

mayeut requested a review from gaborbernat January 2, 2022 00:45

gaborbernat reviewed Jan 2, 2022

View reviewed changes

PR feedback

f0e48bb

Signed-off-by: Bernát Gábor <[email protected]>

gaborbernat approved these changes Jan 2, 2022

View reviewed changes

gaborbernat merged commit 319a540 into pypa:main Jan 2, 2022

mayeut deleted the download-cache branch January 2, 2022 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: cache downloaded wheel information #2276

feature: cache downloaded wheel information #2276

mayeut commented Jan 1, 2022 •

edited

Loading

mayeut Jan 1, 2022 •

edited

Loading

mayeut Jan 1, 2022

mayeut Jan 1, 2022

gaborbernat Jan 1, 2022

gaborbernat Jan 1, 2022

mayeut Jan 1, 2022

mayeut Jan 1, 2022

mayeut Jan 2, 2022

mayeut commented Jan 1, 2022 •

edited

Loading

mayeut commented Jan 1, 2022 •

edited

Loading

mayeut commented Jan 1, 2022

mayeut commented Jan 2, 2022

gaborbernat Jan 2, 2022

mayeut Jan 2, 2022

gaborbernat Jan 2, 2022

		# we need to sort versions for the next update to work properly
		u_log.versions.sort(reverse=True)

feature: cache downloaded wheel information #2276

feature: cache downloaded wheel information #2276

Conversation

mayeut commented Jan 1, 2022 • edited Loading

Thanks for contributing, make sure you address all the checklists (for details on how see

mayeut Jan 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayeut commented Jan 1, 2022 • edited Loading

mayeut commented Jan 1, 2022 • edited Loading

mayeut commented Jan 1, 2022

mayeut commented Jan 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mayeut commented Jan 1, 2022 •

edited

Loading

mayeut Jan 1, 2022 •

edited

Loading

mayeut commented Jan 1, 2022 •

edited

Loading

mayeut commented Jan 1, 2022 •

edited

Loading