Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand Self-Hosted Telemetry With Customer ID #10183

Closed
lucasvaltl opened this issue May 23, 2022 · 13 comments · Fixed by #10629
Closed

Expand Self-Hosted Telemetry With Customer ID #10183

lucasvaltl opened this issue May 23, 2022 · 13 comments · Fixed by #10629
Assignees
Labels
needs visual design team: delivery Issue belongs to the self-hosted team

Comments

@lucasvaltl
Copy link
Contributor

lucasvaltl commented May 23, 2022

Is your feature request related to a problem? Please describe

Currently, all of the data we receive from each Self-Hosted Gitpod instance (assuming the user did not opt-out) is anonymised. (For context, the only data we currently receive is at the aggregated level: Gitpod version, number of users, number of instances and number of workspaces). This means that although we can extract general insights from this data, we cannot specifically help individual customers as we have no way to identify them.

Describe the behaviour you'd like

  • Add the capability to send the license ID or some other form of unique identifier of the owner of a given self-hosted instance with each data package
  • Allow users to specifically opt-out of this identifiable info (@gtsiolis I think it would be great to get your design take on such a two-layer opt out)
  • Make sure this is communicated in honesty to users that we are doing this and why we are doing this. Explain to them how they can opt-out.

Additional context

Internal slack threads (1,2)

@lucasvaltl
Copy link
Contributor Author

Potential follow up - write a docs page about this.

@corneliusludmann
Copy link
Contributor

Regarding sending the license ID vs. adding a custom customer field to the license:

  • Using the license ID has the advantage that we don't need to add a custom customer field value to each license. We can just use the existing license ID and product-wise we are there.
  • On the analytics side, license ID makes it a little bit harder because we need to integrate an additional data source (Replicated vendor API) and have to deal with changing license IDs for the same customer (where a custom customer ID would remain the same forever).
  • With a custom customer field we would have more control over it (not sure if we need it, though). E.g. we could get the consents to track with a customer ID outside of the product (e.g. by signing the papers) and add the customer ID only to these customers that are fine with that. All other customer will end up with a blank field.
  • A custom customer field would work the same for Replicated and the legacy Gitpod license (used by the installer). With the license ID we are more dependent on Replicated. (IMO not a big issue, though.)

In my opinion, both a custom field and license ID would work. The license ID is just there, with a custom field we have more flexibility. Technically, I would tend to a custom field but and the end of the day the customer success team has to manage the custom IDs when we decide on this and I don't know how practical it is for them.

(copied from internal discussion)

@corneliusludmann corneliusludmann added the team: delivery Issue belongs to the self-hosted team label May 23, 2022
@lucasvaltl
Copy link
Contributor Author

@jakobhero what would be your preferred implementation here? Might be easiest for us to jump on a call to quickly discuss the options here :)

@lucasvaltl
Copy link
Contributor Author

Quick FYI that @MircoatGitpod gave his 👍 from a GDPR and privacy policy perspective.

@jakobhero
Copy link
Contributor

@jakobhero what would be your preferred implementation here? Might be easiest for us to jump on a call to quickly discuss the options here :)

hi @lucasvaltl, thanks for bringing this to my attention! using the license ID seems like a reasonable path here that i would assume would be perceived as the least intrusive from a customer's perspective. here are some thoughts for the implementation:

  • user attributes are typically tracked as traits with segment via the identify method, see here
  • segment follows an identity resolution process that accepts two forms of ID: a user_id of "known" users and an anonymous_id of users that have yet to be identified. here, it would make sense to use the ID that we have used so far as anonymous_id and then resolve the anonymous identity whenever we know the license ID of a user by calling identify(anonymousId: <old_id>, userId: <license_id>), but i'm happy to discuss whether this is the best approach
  • i don't know how it works with replicated, the effort required here would be dictated by whether replicated has webhooks that can be used to send the license information: if yes, i am happy to write a custom source through segment's function feature which would take me 1-2 hours. if not, we would have to run a job that requests all license data periodically. my preferred method here would be to use GCP's cloud composer, a managed apache airflow service, which is the most prevalent platform used for data engineering workflows outside of the customer data platform (i.e. segment). we are currently not using it yet so the setup and onboarding would take a little bit of time, though it's a technology that is only going to be more relevant for us in the coming months for enrichment of enterprise sales data

i would also be happy with the custom customer field approach: tracking-wise, it would not pose any issues, but i cannot comment on how much effort it would require to make the according changes in the product so that the customer data would be available when the tracking call is made and whether it would make sense to have it available at that time. happy to jump on a quick call find the optimal solution anytime!

@lucasvaltl
Copy link
Contributor Author

Thanks for the very detailed analysis @jakobhero! I've checked with our customer experience team, and the added effort here seems negligible. @corneliusludmann what do you think is the implementation cost/complexity of the custom field approach?

@corneliusludmann
Copy link
Contributor

Technically, there is no difference in the implementation whether we send the license ID or a custom field via telemetry.

@corneliusludmann
Copy link
Contributor

the effort required here would be dictated by whether replicated has webhooks

There is no webhook support.

@lucasvaltl
Copy link
Contributor Author

Understood! Thanks for the estimate @corneliusludmann. I also spoke with Julia and she is ok with the added effort of filling in the custom field when creating a license. Given the need to do extra work on the data side (scrape replicated's API) & just the added effort of having to use a secondary table to find customer names whenever this data is used, I think it is net beneficial to go with the custom field :) Please lmk if you disagree!

@adrienthebo
Copy link
Contributor

adrienthebo commented Jun 14, 2022

Copying this conversation from a meeting with @mrsimonemms;

Creating and using a custom Customer ID field is a bit more involved than originally anticipated due to limitations in the Gitpod legacy licenses. The legacy license payload is a struct that doesn't support arbitrary key/value pairs, and we've built the replicated license on top of top of the gitpod license struct.

Adding either a CustomerId field or arbitrary key/value section to the license field is possible, but we'll need to wire these fields through the license infrastructure in a backwards compatible fashion. This will likely mean touching many of the locations where we interact with license fields.

@mrsimonemms did this adequately cover our conversation?

@corneliusludmann
Copy link
Contributor

Adding either a CustomerId field or arbitrary key/value section to the license field is possible, but we'll need to wire these fields through the license infrastructure in a backwards compatible fashion. This will likely mean touching many of the locations where we interact with license fields.

That sounds so serious. What is the issue with adding a new field

CustomerID string `json:"string"`

to the struct that is empty by default (and will be set for Replicated license only)? What kind of backwards compatibility issue are you worry about?

@adrienthebo
Copy link
Contributor

Adding that single field will definitely work and is likely the fastest solution. My thinking is how flexible do we need to make this - does it make sense to support arbitrary key/value pairs to reflect the flexibility of replicated licensing?

Though, thinking on the shipping skateboard principle, adding a CustomerID field will be the fastest route to value. I can proceed with that approach!

@corneliusludmann
Copy link
Contributor

Ah, now I see your point. I think we don't need that flexibility, and adding the single field is fine. 👍

@adrienthebo adrienthebo changed the title Expand Self-Hosted Telemetry With License ID Expand Self-Hosted Telemetry With Customer ID Jun 17, 2022
Repository owner moved this from ⚒In Progress to ✨Done in 🚚 Security, Infrastructure, and Delivery Team (SID) Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs visual design team: delivery Issue belongs to the self-hosted team
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants