Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] caching VCS requirements that are considered immutables #6640

Closed
sbidoul opened this issue Jun 23, 2019 · 12 comments · Fixed by #6851
Closed

[RFC] caching VCS requirements that are considered immutables #6640

sbidoul opened this issue Jun 23, 2019 · 12 comments · Fixed by #6851
Labels
auto-locked Outdated issues that have been locked by automation C: cache Dealing with cache and files in it C: network connectivity type: feature request Request for a new feature

Comments

@sbidoul
Copy link
Member

sbidoul commented Jun 23, 2019

What's the problem this feature will solve?

Pip allows installing from VCS URL references such as git+https://github.com/org/repo@{ref}#egg=name.

When {ref} points to an immutable revision (ie a commit hash / sha in case of git),
it would help performance to cache the wheel that was built from that reference,
and subsequently reuse it.

Describe the solution you'd like

I'm willing to implement the PR myself.

My plan is to add a is_immutable(cls, revision/link) to the VersionControl class.
In the case of git it would check revision looks like a sha ('^[a-fA-F0-9]{40}$').
Then exploit it in caching mechanisms.

Potential issues:

  • something that looks like a sha might also be a branch name which would therefore not be immutable: I believe this is not an issue in practice because nobody would name a branch
    with exactly 40 characters in a way that git could confuse with a a sha
  • in case of binary wheels (ie containing C code), the VCS URL may not contain enough information to ensure a reproducible build, as some build parameters may be environmental; in such cases the cache may trigger returning a wheel that otherwise would have to be rebuilt (eg for a different architecture): I'm not seeing any practical issue here either; if there would be a risk we can consider controlling the feature with a flag (such as --(no)-vcs-cache) [update] pip's wheel cache takes care of this as wheel names contain platform specific tags

Alternative Solutions

I currently have a pip wrapper that does something close, but I thought it would be useful to others, so I propose to implement it in pip itself.

Before attempting an implementation, I wanted to check if such a feature is of interest to pip maintainers, or if any of the potential issues would be considered blocking.

Additional context

N/A

@triage-new-issues triage-new-issues bot added the S: needs triage Issues/PRs that need to be triaged label Jun 23, 2019
@pradyunsg pradyunsg added C: cache Dealing with cache and files in it C: network connectivity type: feature request Request for a new feature and removed S: needs triage Issues/PRs that need to be triaged labels Jun 23, 2019
@pradyunsg
Copy link
Member

This sounds like a good idea to me. /cc @cjerdonek since he's most familiar with the VCS code.

@cjerdonek
Copy link
Member

something that looks like a sha might also be a branch name which would therefore not be immutable: I believe this is not an issue in practice because nobody would name a branch
with exactly 40 characters in a way that git could confuse with a a sha

This is actually the thing I'm concerned about with this (and the reason I hadn't responded to this issue yet). I feel uncomfortable about the idea of changing pip's behavior based on what the revision "looks like." This kind of approach seems brittle to me, and yes, would break edge cases where someone has or might want to name a branch that happens to have the form of a sha. I can see that capability being useful as a work-around or in testing situations, or maybe for reasons we don't know. My instinct is that pip should really have a more structured way to specify "this revision is a SHA" as opposed to relying on whether the string "looks like" something. But I haven't really given thought to what that should be since it could likely involve a new format / notation of URL, etc.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 10, 2019

Would you then be open to having an option that enables the feature? Or an option to disable it if we think it could still be on by default (in which case --no-cache-dir would maybe do the job).

@cjerdonek
Copy link
Member

I'm certainly open to having a way to specify "this revision is a SHA," but I'm not sure yet what that should be. Do you have any ideas? Having a global option that means "interpret revisions that look like SHA's as commit hashes" seems less attractive to me because it would be an option that is basically a work-around for not being able to specify revisions in a structured way (and pip already has a lot of specialized options). I do wonder if this (caching) could be leveraged with the PEP that we're working on -- because at that point, we would have the information we need in a structured way (in the JSON file). So maybe working on this after the PEP would make more sense? What do you think?

@sbidoul
Copy link
Member Author

sbidoul commented Aug 10, 2019

I've no immediate idea on how to change the pip requirements format to express that a rev is a commit id. Especially to do it in a way that would not break tools out there that interpret it.

Another way to address this concern would be to decide to cache or not after building the wheel (here?), when we have a local checkout that we can inspect to know if the rev is a commit or a branch.

The only remaining edge case would then be when a branch has the same name as an existing commit id. That, if possible at all, is exotic enough to be ignored.

This approach indeed relates closely to the PEP for #609, since to implement it we need to extract the commit id to store it in the direct url origin metadata as resolved_commit_id.

So yes, we can keep this in mind while implementing #609 and come back to it just after.

The ball is with you to get the PEP rolling :)

@cjerdonek
Copy link
Member

Also, I think there's also the possibility of doing something where the revision is cached, but only after pip inspects it to see if the revision corresponds to a branch, tag, or SHA, etc. So the difference in perspective (e.g. from your PR I looked at) would be: when a revision is encountered, we don't ask "does this revision correspond to a commit hash because it looks like a SHA," but rather "is this revision cached" and use its presence in the cache to know if it corresponded to a commit hash. And seeing if it looks like a hash would basically be a check / optimization to know whether it has the possibility of being in the cache.

@cjerdonek
Copy link
Member

Another way to address this concern would be to decide to cache or not after building the wheel (here?), when we have a local checkout that we can inspect to know if the rev is a commit or a branch.

Yes, what I just typed up is I think the same as what you're suggesting, though the way I would phrase it is "decide to cache after cloning the URL and inspecting the refs and SHA, etc. to see if the revision was a commit hash." pip already has this logic (except possibly for the last step of checking whether the revision matches the SHA). I do think we should keep pip's behavior of preferring the branch name over the SHA if there is a collision (since that is what Git does). So it would only be cached if the revision is a SHA that doesn't match a branch or tag name, etc.

@sbidoul
Copy link
Member Author

sbidoul commented Aug 10, 2019

By the way, when this (vcs caching) and #609 (vcs freeze) are implemented, it will make --editable a much less important feature for many.

@pradyunsg
Copy link
Member

#609 is probably more inline with something that we should look into adding into Pipfile, if it isn't already included.

@pfmoore
Copy link
Member

pfmoore commented Aug 10, 2019

The ball is with you to get the PEP rolling :)

From what I recall of the discussion, @cjerdonek had offered to sponsor this PEP (I'm not 100% sure, because the sponsor details haven't been added to the PEP yet) so he'll be better able to help you navigate the process, but I think that, as PEP author, the ball is really with you :-)

From a quick skim to remind myself of where the discussion is at, I think that the debate on Discourse has died down. I'm not aware of any actions people are waiting on, so I think what you should probably do now is summarise the position, and list out what you think the next steps are, and how you think they should be progressed (remembering that this is all being done on volunteer time, so it's important to allow for people simply not having the free time to work on this).

In my experience, PEPs do need a lot of work to "manage" the process, as otherwise they tend to die through lack of interest, rather than any inherent flaw in the proposal.

@cjerdonek
Copy link
Member

@pfmoore He was actually giving me a friendly nudge because he's waiting on an email from me. :) I'm now co-authoring the PEP with him, and he was awaiting a reply from me confirming it's ready for submission. The list should see an update within the next couple days or so once I get to replying to his email..

@pfmoore
Copy link
Member

pfmoore commented Aug 10, 2019

Oh excellent - sorry for butting in in that case :-) The problems of things happening in a mixture of public and private forums.

Sorry @sbidoul for the unwarranted assumption on my part.

@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Dec 4, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Dec 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation C: cache Dealing with cache and files in it C: network connectivity type: feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants