[fast-deps] Add a hook to download all "skipped" wheels #8638

McSinyx · 2020-07-27T14:42:21Z

This creates a hook to explicitly download distributions of requirements whose metadata are obtain from their wheels lazily (GH-8588) so that

It is possible to supply the required distributions to a worker pool for parallel download
Something like pip install --dry-run can be cleanly implemented by interject an interruption right before the download

Although a decent amount of refactoring occurred, this should only affects the behavior of the fast-deps feature.

src/pip/_internal/resolution/resolvelib/resolver.py

pradyunsg · 2020-07-29T02:11:58Z

@McSinyx and I had a chat about this -- the goal here is to implement "do the downloads" as a separate loop within the resolver for now. We've decided to factor out the RequirementSet creation logic into a dedicated function for now, and to have a separate stage prior to creating the RequirementSet where we iterate through and perform the full downloads.

The main design issue we're facing is figuring out how to implement that separate stage without changing behaviors for the rest of the resolver. I have a feeling we'd need to make a lot of changes here. :)

chrahunt · 2020-08-02T01:17:47Z

src/pip/_internal/network/download.py

+                content_file.write(chunk)
+        return file_path, response.headers.get('content-type', '')
+
+    def __call__(self, link, location=None, hashes=None):


Previously this class and function call did essentially 1 thing: download a file with progress. That was a good thing because it could be understood as a single small chunk, passed around, and used without much effort. When we add more responsibilities it gets harder to understand and maintain.

Instead of pulling functionality from RequirementPreparer into this, and then invoking this from the new resolver, what if we:

move the functionality from RequirementPreparer somewhere else and invoke it from RequirementPreparer and the new resolver, or

(preferred) move downloading out of RequirementPreparer all together, and do it either:

internally in AbstractDistribution/Candidate when required to satisfy some part of processing (transparently to the caller), or

in a common place after resolving has happened, if still needed

(preferred) move downloading out of RequirementPreparer all together, and do it either:

I haven't looked at this PR yet, but IIUC, this was the conclusion from what @McSinyx and I discussed. I do think I agree that Downloader might not be the best place to put this though. :)

On current master, network.download.Downloader only initiates the download and passes the lazy response to the preparation code which then performs the actual download (to a file). What 744896f aims to do is making sure that Downloader.__call__ will perform the download, with addition to hash checking to make the interface consistent when caching is involved. In the way I see it, Downloader should be able to act analogous to wget, which is given an URL/Link, optional output target (-O). I assume more responsibilities refers to the hash checking.

The reason I placed the call to Downloader directly in resolve is that it is easier to ensure that such call is thread-safe than a higher-level wrapper like a method in RequirementPreparer. It would also be more trivial to handle the progress bar for a batch of downloads directly within the Downloader than extracting it from the preparer.

If I understand correctly, then RequirementPreparer seems to be the preparer for candidates of the new resolver, and downloading something happens to be part of the work for some candidates, i.e. RequirementPreparer sometimes need to download to be able to construct an (Abstract)Distribution to give to the candidate, so I'm not sure if I understand the model (2.i) is referring to.

Regarding (2.ii), I completely agree that it would be really nice if we can have this call outside of the resolver code. However, in a discuss with @pradyunsg, we came to a conclusion that InstallRequirement is currently to tangled with the preparation code (or that it logically requires the underlying file to exist), so that we will move the call out once we reworked RequirementSet. @chrahunt, I'm not sure if I've correctly understood and addressed your points since my head is buzzing as I try to visualize this but I hope I am able to deliver some reasonings. Do they change your mind about 744896f or do you have any further suggestions?

@pradyunsg, we came to a conclusion that InstallRequirement is currently to tangled with the preparation code (or that it logically requires the underlying file to exist), so that we will move the call out once we reworked RequirementSet.

The issue is that the code that gets run after Resolver.resolve returns, is closely tied to the assumption that the underlying file exists. We don't need for reworking of RequirementSet, but rather, we need to cleanly handle the "I need to be fully downloaded" state without needing to touch the old resolver code.

@McSinyx I think some of my concern was an extension of my original comments in #8532, which I did not follow up on in #8588 (sorry). I played with it a bit and came up with #8685, which I think we can use as a pretty clean base for your download work. Please take a look!

Regarding the downloader, I think we can probably achieve the desired effect by creating a new function or class that uses it as-is instead of putting more logic into it. Since our fast-deps feature isn't enabled when doing hash checking, the wrapper won't need any of that stuff.

We don't need for reworking of RequirementSet, but rather, we need to cleanly handle the "I need to be fully downloaded" state without needing to touch the old resolver code.

Thank you for the correction.

I played with it a bit and came up with 8685, which I think we can use as a pretty clean base for your download work. Please take a look!

Thanks, I haven't read it carefully but the resulting hook seems really nice!

Regarding the downloader, I think we can probably achieve the desired effect by creating a new function or class that uses it as-is instead of putting more logic into it. Since our fast-deps feature isn't enabled when doing hash checking, the wrapper won't need any of that stuff.

This sounds good to me too. @pradyunsg said that you two had a discussion on this and I assume this is the consensus that came out of it.

I was thinking about the hashing part, and IMHO apparently we should check them, just to make sure that a wheel is not corrupted during the download process.

I was thinking about the hashing part, and IMHO apparently we should check them, just to make sure that a wheel is not corrupted during the download process.

If we're in "require-hashes" mode, doing partial downloads does not make sense, since we're only permitted to trust artifacts/files with that specific hash. And the only way to compute that hash is to download the entire file. Since we're going to download the entire file (and we know that if you have hashes, you also have performed dependency resolution already; so we're definitely using that downloaded file), we should not use partial downloads/fast-deps when hash-checking mode is enabled.

I think I didn't communicate this right 😛 Let me provide more context (or at least the context I'm assuming):

We want a routine taking multiple URLs and download them in parallel, with UI taken into account

Before that's done, let's have a map/for that call something download of each of them, and that something should be thread-safe

It'll be difficult to keep all preparations thread-safe, even it is at the moment (I think)

pip/src/pip/_internal/operations/prepare.py

Lines 493 to 498 in 8b838eb

def prepare_linked_requirement_more(self, req, parallel_builds=False):

# type: (InstallRequirement, bool) -> None

"""Prepare a linked requirement more, if needed."""

if not req.needs_more_preparation:

return

self._prepare_linked_requirement(req, parallel_builds)

Rather, let what to be parallelized to be solely downloading the file to (usually) a temp dir and let the preparation code use it as cache latter. Both (after download and when verifying the cache) needs hash checking because the file might corrupt for some reason.

I'm not sure if it's the effect of --no-cache-dir, but some wheels just don't get cached within a session. This doesn't seem to sufficient:

pip/src/pip/_internal/network/download.py

Lines 63 to 66 in 370322e

if is_from_cache(resp):

logger.info("Using cached %s", logged_url)

else:

logger.info("Downloading %s", logged_url)

but this is:

pip/src/pip/_internal/operations/prepare.py

Lines 290 to 300 in 370322e

def _check_download_dir(link, download_dir, hashes):

# type: (Link, str, Optional[Hashes]) -> Optional[str]

""" Check download_dir for previously downloaded file with correct hash

If a correct file is found return its path else None

"""

download_path = os.path.join(download_dir, link.filename)

if not os.path.exists(download_path):

return None

# If already downloaded, does its hash match?

If we use a common temp dir for all downloads, this can be down automatically and this common state really well suit Downloader. Furthermore the similarity between Downloader.download_one (or something to be renamed from current .__call__) and Downloader.download_batch would be really nice.

I am truly sorry for not clarifying this earlier in a discussion point and ended up causing a decent amount of confusion 😅

BrownTruck · 2020-08-04T16:45:07Z

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

McSinyx · 2020-08-05T07:55:10Z

I'm closing this in favor of GH-8685 and GH-8697.

McSinyx force-pushed the late-download branch 6 times, most recently from a34332a to 191f962 Compare July 28, 2020 15:49

pradyunsg reviewed Jul 29, 2020

View reviewed changes

src/pip/_internal/resolution/resolvelib/resolver.py Outdated Show resolved Hide resolved

McSinyx force-pushed the late-download branch 3 times, most recently from 0af3a4c to 5726655 Compare August 1, 2020 10:23

McSinyx added 6 commits August 1, 2020 20:41

Separate req set creation into a new method

03b460b

Make the call to actual download a bit less nested

744896f

Clean up unpack_url, especially its docstring and comments

823648b

Type hint derivations of Candidates.get_install_requirement explicitly

e6cb24c

Expose InstallRequirementBackedCandidate and its .link

d8484e0

Explicitly download wheels of link candidates

158931d

McSinyx force-pushed the late-download branch from 5726655 to 158931d Compare August 1, 2020 15:47

McSinyx marked this pull request as ready for review August 1, 2020 15:49

McSinyx changed the title ~~Allow some wheels to be downloaded after resolution~~ [fast-deps] Add a hook to download all "skipped" wheels Aug 1, 2020

McSinyx requested a review from pradyunsg August 1, 2020 15:53

chrahunt reviewed Aug 2, 2020

View reviewed changes

This was referenced Aug 2, 2020

Abstract away AbstractDistribution in higher-level resolver code #8629

Merged

Move fast-dep handling to RequirementPreparer #8685

Merged

McSinyx mentioned this pull request Aug 4, 2020

[fast-deps] How should we isolate the networking code for late download in parallel? #8697

Open

BrownTruck added the needs rebase or merge PR has conflicts with current master label Aug 4, 2020

McSinyx closed this Aug 5, 2020

github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fast-deps] Add a hook to download all "skipped" wheels #8638

[fast-deps] Add a hook to download all "skipped" wheels #8638

McSinyx commented Jul 27, 2020 •

edited

Loading

pradyunsg commented Jul 29, 2020

chrahunt Aug 2, 2020

pradyunsg Aug 2, 2020

McSinyx Aug 2, 2020 •

edited

Loading

pradyunsg Aug 2, 2020

chrahunt Aug 2, 2020

McSinyx Aug 3, 2020

McSinyx Aug 3, 2020

pradyunsg Aug 4, 2020 •

edited

Loading

McSinyx Aug 4, 2020

BrownTruck commented Aug 4, 2020

McSinyx commented Aug 5, 2020

	def prepare_linked_requirement_more(self, req, parallel_builds=False):
	# type: (InstallRequirement, bool) -> None
	"""Prepare a linked requirement more, if needed."""
	if not req.needs_more_preparation:
	return
	self._prepare_linked_requirement(req, parallel_builds)

	if is_from_cache(resp):
	logger.info("Using cached %s", logged_url)
	else:
	logger.info("Downloading %s", logged_url)

	def _check_download_dir(link, download_dir, hashes):
	# type: (Link, str, Optional[Hashes]) -> Optional[str]
	""" Check download_dir for previously downloaded file with correct hash
	If a correct file is found return its path else None
	"""
	download_path = os.path.join(download_dir, link.filename)

	if not os.path.exists(download_path):
	return None

	# If already downloaded, does its hash match?

[fast-deps] Add a hook to download all "skipped" wheels #8638

[fast-deps] Add a hook to download all "skipped" wheels #8638

Conversation

McSinyx commented Jul 27, 2020 • edited Loading

pradyunsg commented Jul 29, 2020

chrahunt Aug 2, 2020

Choose a reason for hiding this comment

pradyunsg Aug 2, 2020

Choose a reason for hiding this comment

McSinyx Aug 2, 2020 • edited Loading

Choose a reason for hiding this comment

pradyunsg Aug 2, 2020

Choose a reason for hiding this comment

chrahunt Aug 2, 2020

Choose a reason for hiding this comment

McSinyx Aug 3, 2020

Choose a reason for hiding this comment

McSinyx Aug 3, 2020

Choose a reason for hiding this comment

pradyunsg Aug 4, 2020 • edited Loading

Choose a reason for hiding this comment

McSinyx Aug 4, 2020

Choose a reason for hiding this comment

BrownTruck commented Aug 4, 2020

McSinyx commented Aug 5, 2020

McSinyx commented Jul 27, 2020 •

edited

Loading

McSinyx Aug 2, 2020 •

edited

Loading

pradyunsg Aug 4, 2020 •

edited

Loading