Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fast-deps] Add a hook to download all "skipped" wheels #8638

Closed
wants to merge 6 commits into from

Conversation

McSinyx
Copy link
Contributor

@McSinyx McSinyx commented Jul 27, 2020

This creates a hook to explicitly download distributions of requirements whose metadata are obtain from their wheels lazily (GH-8588) so that

  1. It is possible to supply the required distributions to a worker pool for parallel download
  2. Something like pip install --dry-run can be cleanly implemented by interject an interruption right before the download

Although a decent amount of refactoring occurred, this should only affects the behavior of the fast-deps feature.

@McSinyx McSinyx force-pushed the late-download branch 6 times, most recently from a34332a to 191f962 Compare July 28, 2020 15:49
@pradyunsg
Copy link
Member

@McSinyx and I had a chat about this -- the goal here is to implement "do the downloads" as a separate loop within the resolver for now. We've decided to factor out the RequirementSet creation logic into a dedicated function for now, and to have a separate stage prior to creating the RequirementSet where we iterate through and perform the full downloads.

The main design issue we're facing is figuring out how to implement that separate stage without changing behaviors for the rest of the resolver. I have a feeling we'd need to make a lot of changes here. :)

@McSinyx McSinyx force-pushed the late-download branch 3 times, most recently from 0af3a4c to 5726655 Compare August 1, 2020 10:23
@McSinyx McSinyx marked this pull request as ready for review August 1, 2020 15:49
@McSinyx McSinyx changed the title Allow some wheels to be downloaded after resolution [fast-deps] Add a hook to download all "skipped" wheels Aug 1, 2020
@McSinyx McSinyx requested a review from pradyunsg August 1, 2020 15:53
content_file.write(chunk)
return file_path, response.headers.get('content-type', '')

def __call__(self, link, location=None, hashes=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this class and function call did essentially 1 thing: download a file with progress. That was a good thing because it could be understood as a single small chunk, passed around, and used without much effort. When we add more responsibilities it gets harder to understand and maintain.

Instead of pulling functionality from RequirementPreparer into this, and then invoking this from the new resolver, what if we:

  1. move the functionality from RequirementPreparer somewhere else and invoke it from RequirementPreparer and the new resolver, or
  2. (preferred) move downloading out of RequirementPreparer all together, and do it either:
    1. internally in AbstractDistribution/Candidate when required to satisfy some part of processing (transparently to the caller), or
    2. in a common place after resolving has happened, if still needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(preferred) move downloading out of RequirementPreparer all together, and do it either:

I haven't looked at this PR yet, but IIUC, this was the conclusion from what @McSinyx and I discussed. I do think I agree that Downloader might not be the best place to put this though. :)

Copy link
Contributor Author

@McSinyx McSinyx Aug 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On current master, network.download.Downloader only initiates the download and passes the lazy response to the preparation code which then performs the actual download (to a file). What 744896f aims to do is making sure that Downloader.__call__ will perform the download, with addition to hash checking to make the interface consistent when caching is involved. In the way I see it, Downloader should be able to act analogous to wget, which is given an URL/Link, optional output target (-O). I assume more responsibilities refers to the hash checking.

The reason I placed the call to Downloader directly in resolve is that it is easier to ensure that such call is thread-safe than a higher-level wrapper like a method in RequirementPreparer. It would also be more trivial to handle the progress bar for a batch of downloads directly within the Downloader than extracting it from the preparer.

If I understand correctly, then RequirementPreparer seems to be the preparer for candidates of the new resolver, and downloading something happens to be part of the work for some candidates, i.e. RequirementPreparer sometimes need to download to be able to construct an (Abstract)Distribution to give to the candidate, so I'm not sure if I understand the model (2.i) is referring to.

Regarding (2.ii), I completely agree that it would be really nice if we can have this call outside of the resolver code. However, in a discuss with @pradyunsg, we came to a conclusion that InstallRequirement is currently to tangled with the preparation code (or that it logically requires the underlying file to exist), so that we will move the call out once we reworked RequirementSet. @chrahunt, I'm not sure if I've correctly understood and addressed your points since my head is buzzing as I try to visualize this but I hope I am able to deliver some reasonings. Do they change your mind about 744896f or do you have any further suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pradyunsg, we came to a conclusion that InstallRequirement is currently to tangled with the preparation code (or that it logically requires the underlying file to exist), so that we will move the call out once we reworked RequirementSet.

The issue is that the code that gets run after Resolver.resolve returns, is closely tied to the assumption that the underlying file exists. We don't need for reworking of RequirementSet, but rather, we need to cleanly handle the "I need to be fully downloaded" state without needing to touch the old resolver code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@McSinyx I think some of my concern was an extension of my original comments in #8532, which I did not follow up on in #8588 (sorry). I played with it a bit and came up with #8685, which I think we can use as a pretty clean base for your download work. Please take a look!

Regarding the downloader, I think we can probably achieve the desired effect by creating a new function or class that uses it as-is instead of putting more logic into it. Since our fast-deps feature isn't enabled when doing hash checking, the wrapper won't need any of that stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need for reworking of RequirementSet, but rather, we need to cleanly handle the "I need to be fully downloaded" state without needing to touch the old resolver code.

Thank you for the correction.

I played with it a bit and came up with 8685, which I think we can use as a pretty clean base for your download work. Please take a look!

Thanks, I haven't read it carefully but the resulting hook seems really nice!

Regarding the downloader, I think we can probably achieve the desired effect by creating a new function or class that uses it as-is instead of putting more logic into it. Since our fast-deps feature isn't enabled when doing hash checking, the wrapper won't need any of that stuff.

This sounds good to me too. @pradyunsg said that you two had a discussion on this and I assume this is the consensus that came out of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the hashing part, and IMHO apparently we should check them, just to make sure that a wheel is not corrupted during the download process.

Copy link
Member

@pradyunsg pradyunsg Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the hashing part, and IMHO apparently we should check them, just to make sure that a wheel is not corrupted during the download process.

If we're in "require-hashes" mode, doing partial downloads does not make sense, since we're only permitted to trust artifacts/files with that specific hash. And the only way to compute that hash is to download the entire file. Since we're going to download the entire file (and we know that if you have hashes, you also have performed dependency resolution already; so we're definitely using that downloaded file), we should not use partial downloads/fast-deps when hash-checking mode is enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I didn't communicate this right 😛 Let me provide more context (or at least the context I'm assuming):

  • We want a routine taking multiple URLs and download them in parallel, with UI taken into account
  • Before that's done, let's have a map/for that call something download of each of them, and that something should be thread-safe
  • It'll be difficult to keep all preparations thread-safe, even it is at the moment (I think)
    def prepare_linked_requirement_more(self, req, parallel_builds=False):
    # type: (InstallRequirement, bool) -> None
    """Prepare a linked requirement more, if needed."""
    if not req.needs_more_preparation:
    return
    self._prepare_linked_requirement(req, parallel_builds)
  • Rather, let what to be parallelized to be solely downloading the file to (usually) a temp dir and let the preparation code use it as cache latter. Both (after download and when verifying the cache) needs hash checking because the file might corrupt for some reason.
  • I'm not sure if it's the effect of --no-cache-dir, but some wheels just don't get cached within a session. This doesn't seem to sufficient:
    if is_from_cache(resp):
    logger.info("Using cached %s", logged_url)
    else:
    logger.info("Downloading %s", logged_url)
    but this is:
    def _check_download_dir(link, download_dir, hashes):
    # type: (Link, str, Optional[Hashes]) -> Optional[str]
    """ Check download_dir for previously downloaded file with correct hash
    If a correct file is found return its path else None
    """
    download_path = os.path.join(download_dir, link.filename)
    if not os.path.exists(download_path):
    return None
    # If already downloaded, does its hash match?
  • If we use a common temp dir for all downloads, this can be down automatically and this common state really well suit Downloader. Furthermore the similarity between Downloader.download_one (or something to be renamed from current .__call__) and Downloader.download_batch would be really nice.

I am truly sorry for not clarifying this earlier in a discussion point and ended up causing a decent amount of confusion 😅

@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

@BrownTruck BrownTruck added the needs rebase or merge PR has conflicts with current master label Aug 4, 2020
@McSinyx
Copy link
Contributor Author

McSinyx commented Aug 5, 2020

I'm closing this in favor of GH-8685 and GH-8697.

@McSinyx McSinyx closed this Aug 5, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs rebase or merge PR has conflicts with current master
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants