-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differentiating organic vs automated installations #5499
Comments
Note that because of pypi/linehaul#30, the numbers on Google BigQuery may already be meaningless for answering some questions. |
I agree. It would definitely be useful to have separation of automated vs direct usage. Maybe pypa/packaging-problems would be a good place for it? |
@mhsmith like Nathaniel pointed out, the lossage should be fairly uniform, so I think the numbers would still be somewhat representative, if we were collecting them on top of the leaky linehaul, that is :) @pradyunsg Glad you agree! Given that I suspect (and suggested) a straightforward |
I think the fundamental problem here is I don't think you can actually detect this reasonably. For instance, if someone manually runs a bash script (or even a tox command), we'd probably want that to be not set as automated-- but by default those things will not have a tty. On the flip side, you have things like Travis CI which I believe mimics a tty, so then Travis CI will look like like a manual install instead of automated. On a theoretical level, I don't have any problem with the idea-- I just have never been able to think of a good way of actually differentiating the types of uses automatically. |
If we want to detect running under CI, I think that's actually fairly easy, because CI systems tend to advertise that fact in the environment. Just checking for Or if you want to get fancier, it looks like the ci-info package (2.5 million weekly downloads) has a fairly comprehensive list of envvars to check for: https://github.com/watson/ci-info/blob/master/index.js |
See https://github.com/The-Compiler/pytest-vw for a Python project that can detect CI. |
Yea, it isn't difficult to detect whether you're running in a CI, on most CI services -- or for that matter even which one you're running on. We likely still won't know what %age of the non-CI runs are not automated but having a separation between CI/non-CI is a good start. I don't know if we'd want to have any distinction between various CI services (logging NULL if we don't have the information, otherwise a string like "travis" representing the service). |
I posted #6273 to start addressing this. |
…-agent Fix #5499: Include in pip's User-Agent whether it looks like pip is in CI
I'm going to leave this open for now as opposed to auto-closing for the purposes of discussing whether an additional key-value should be added to store the value of |
FWIW, I pinged on #zuul on Freenode, to see if anyone there has inputs on how to detect running within Zuul. That said, better detection of that is not a blocker in any form. |
I'd love to see a way for us to tell pip we're running in a CI. For context, Google has several custom CIs that wouldn't be detected by this code, so a flag or env var that's something like "PIP_IS_CI" would be really cool. |
@theacodes if one can set an environment variable, wouldn't setting |
Yeah, it might have unintended consequences.
…On Mon, Apr 29, 2019, 9:51 AM Mahmoud Hashemi ***@***.***> wrote:
@theacodes <https://github.com/theacodes> if one can set an environment
variable, wouldn't setting CI=true achieve just that? Or will that have
an impact on other parts of the CI?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5499 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAB5I4565CGSWO5SZCHYFM3PS4RS7ANCNFSM4FEXCP4A>
.
|
I think it'd be fine (and low maintenance) to support this. The implementation would just be a matter of adding pip/src/pip/_internal/download.py Line 80 in 5a00ac4
|
@theacodes If you could file a PR for this, that'd be great! |
Is |
It seems like it should be for any automated runs, but I'm not the one using this data. Is it worth making the environment variable name more descriptive (e.g. |
Reflecting a bit more on this, to @methane's implicit point, if we're going to expose an environment variable I'm thinking it would be better to call it something like |
I think there are several different things we might be trying to track here.
Test vs non-test: installs for testing are "subsidiary" to "real" installs:
they don't directly solve someone's problem; their purpose is just to make
sure things are working for later when someone tries to use the code for
its primary purpose. If you want to count how many installs are intended to
use the code for its primary purpose, then you want to eliminate test
installs. But if someone installs on a big fleet of production boxes,
that's real usage.
Automated versus interactive: if you want to count how many people actually
typed "pip install mypackage", then that's a different question, and
automated installs *shouldn't* count.
In principle, maybe we should track both of these seperately. More data
allows you to do more :-). In practice, I don't think we have any technical
mechanism to track automated vs interactive installs. Even if everyone on
this thread goes off and manually updates their deployment system to set
some magic envvar, I'm guessing the vast majority of automated installs
*won't* set that envvar, and that will make the data really hard to
interpret.
A field for "is this running in CI?" is also hard to interpret or connect
to what we really want to know, like how many users our project has. But
it's at least technically feasible, and it's easy to communicate what it
does and doesn't mean to people trying to interpret the data.
So I'm inclined to say, let's just keep it as a CI flag for now. And we can
always revisit once we see the data :-)
…On Thu, May 23, 2019, 11:55 Chris Jerdonek ***@***.***> wrote:
Reflecting a bit more on this, to @methane <https://github.com/methane>'s
implicit point, if we're going to expose an environment variable I'm
thinking it would be better to call it something like PIP_IS_AUTOMATED.
That would document the intent more clearly.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#5499?email_source=notifications&email_token=AAEU42ABPK7WMCIKWHB7XGDPW3SDXA5CNFSM4FEXCP4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWDFFBI#issuecomment-495342213>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEU42HTHIOUNWCZWQWP25LPW3SDXANCNFSM4FEXCP4A>
.
|
Okay, that's fine with me. And that would mean then that the answer to @methane's original question ("Is |
What's the problem this feature will solve?
Currently, pip installation statistics are aggregated to the gCloud and made available on libraries.io and pepy.tech. A lot of effort has gone into these numbers, but thanks to automation, they mean less now than they did a few years ago.
CI and other automation, combined with maybe a bit too much reliance on PyPI's central infrastructure, have inflated the download numbers and diluted the signal with noise.
Describe the solution you'd like
We could detect when pip is being used interactively (by checking if stdin is a tty or some other mechanism), and include that in the pip install request headers, to be included in the statistics generated by the server.
This would provide us with much cleaner data for highlighting actual community activity, instead of drowning in automation trends, overly favoring professionalized sectors of Python. Specifically, a library being manually installed 100 times may well indicate something much more interesting than a CI (or, unfortunately, a production) fleet installing a package 10,000 times.
Additional context
Thanks for your attention and keep up the good work!
The text was updated successfully, but these errors were encountered: