Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track (anonymized) counts of repos using Rover #313

Closed
ndintenfass opened this issue Feb 25, 2021 · 6 comments · Fixed by #461
Closed

Track (anonymized) counts of repos using Rover #313

ndintenfass opened this issue Feb 25, 2021 · 6 comments · Fixed by #461
Labels
feature 🎉 new commands, flags, functionality, and improved error messages needs decision 🤝
Milestone

Comments

@ndintenfass
Copy link
Contributor

ndintenfass commented Feb 25, 2021

Our telemetry today sends anonymized usage info, so we can track which commands are used, with an opaque ID for each install of Rover. However, given that Rover often runs in CI environments this isn't an accurate representation of how many projects are using Rover. This is useful for Apollo to understand because we can then track adoption rates and see how often Rover is used.

The proposed addition is to create an anonymized hash of the URL for the repo as part of the telemetry payload, storing this data in our data warehouse associated with each invocation.

A working assumption is that we could use a hash of the origin URL of the git repo local to the invocation of rover. We'll likely want to add some kind of extra characters to it to avoid being able to use brute force approaches to identify the specific repo. One concern raised is that some CI systems generate special origin URLs that may contain credentials, though if we're creating an opaque hash we should be safe to use even such URLs.

We presume this won't be an exact science, in that sometimes the same git repo will end up having differently shaped origin URLs (and, if we can think of a better way to recognize when a given repo is the same repo even when it's running in many CI runs and on many developers' local environments that would be fine too). Part of the design needed here is to vet that our approach will be a reasonably good approximation, not a flawless enumeration.

Per our existing telemetry, users would be able to opt out by not sending any telemetry.

The result of this work should be that we can reason about how many distinct projects are using Rover, even if the invocations for that project are taking place both locally on many developers' machines and in various automated pipelines. As a side effect, we should also be able to make reasonable estimations of how many devs are using rover locally per project because we'll have both the anonymized ID, an indication of whether the invocation is in CI, and the anonymized representation of the repo.

@ndintenfass ndintenfass added feature 🎉 new commands, flags, functionality, and improved error messages needs decision 🤝 labels Feb 25, 2021
@ndintenfass ndintenfass added this to the March 30 - GA milestone Feb 25, 2021
@EverlastingBugstopper
Copy link
Contributor

EverlastingBugstopper commented Feb 25, 2021

Do you think this should replace the current working directory hash that we currently send, or be an additional piece of data?

@ndintenfass
Copy link
Contributor Author

I think an additional one would be good. (lazy question) Do we document the telemetry we send, so people can feel confident we're not trying to be sneaky? Might be worth rationalizing the purpose of working directory as part of this, as it's not as useful but may end up being useful to de-dupe telemetry, for instance.

@EverlastingBugstopper
Copy link
Contributor

I think an additional one would be good.

Sounds good to me! Shouldn't be a heavy lift.

Do we document the telemetry we send, so people can feel confident we're not trying to be sneaky?

yes!

@lrlna lrlna modified the milestones: March 30 - GA, May 4 Mar 12, 2021
@lrlna lrlna modified the milestones: May 11, April 27 Apr 8, 2021
@EverlastingBugstopper
Copy link
Contributor

hey @ndintenfass - should we just special case origin as the remote? or do we want to like, concatenate all of the remotes into one and then hash that? the lib i'm using requires that i specify a name for the remote. fine to just say origin and ignore the rest but want to make sure that's an ok assumption

@EverlastingBugstopper

This comment has been minimized.

@ndintenfass
Copy link
Contributor Author

@EverlastingBugstopper yes, I think we can special-case origin for the sake of this. That should be close enough to what we want to know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature 🎉 new commands, flags, functionality, and improved error messages needs decision 🤝
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants