Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CE to Eng handover docs #1521

Merged
merged 5 commits into from
Sep 30, 2020
Merged

Add CE to Eng handover docs #1521

merged 5 commits into from
Sep 30, 2020

Conversation

dadlerj
Copy link
Member

@dadlerj dadlerj commented Sep 4, 2020

Given the recent discussion, I wanted to get this written down. This is a WIP document, and will be living for some time, especially as more clarity comes to the roles and responsibilities of the CE team. However, I thought it would be beneficial to document the "current state", even if it's not long-lived.

I only added one "new" thing to the process here: a priority tagging system that CE can use to communicate to Eng. Curious to hear your thoughts.

CC @nicksnyder @christinelovett @tistru for visibility


**Engineering should only feel a responsibility to get involved if tagged in by CE.**

**However, once someone is "assigned" to a ticket (whether formally, or they informally take over the conversation), it is up to them to either (1) see the issue through to resolution or (2) assign a new owner.**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the wording so it is clear that if someone "informally" takes over the conversation, then they should be formally assigned the ticket? I want to avoid grey areas on who is responsible.

1. If the issue is clearly a bug or a feature request (rather than a question that can be clarified or answered on the spot), [the CE will file or add on to a GitHub issue](customer_issues.md).
1. The CE will add a prioritization label to the issue, from `user/p0` to `user/p4`, based on a combination of (1) the severity of the issue, and (2) the prioritization of the reporting company. These labels mean the following:
1. `user/p0`: The issue results in the company's Sourcegraph instance being unusable and the company is a [Tier 1 prospect or customer](../sales/index.md#segmentation).
1. `user/p1`: The issue results in partial loss of functionality or serious disruption and the company is a [Tier 1 or Tier 2 prospect or customer](../sales/index.md#segmentation).
Copy link
Contributor

@nicksnyder nicksnyder Sep 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a tier 2 customer (and below) can never cause a p0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a quick and dirty set of definitions though, and Julia needs to weigh in. The definition of what the eng team is supposed to do when a p0 lands is also not clear here, so Gonza should weigh in on that side! This is just a first attempt at providing guidance to new CE team members based on how I internally think about communicating prioritization currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this list is an extension or update to the one described here https://github.com/sourcegraph/about/blob/main/handbook/ce/support.md#slas.

Here is a draft idea on how we could handle this, I would PR this to incident document:

  • p0: All hands on deck, notify on #dev-ops. The incident is the highest priority for all engineering teams and should drop any other tasks to work on the incident as required by the incident owner (not all at once)
  • p1: An engineer should prioritize this over all other tasks and transfer to other teams as required.
  • p2: We will resolve in a best-efforts basis while we continue to work on our planned release.
  • p3: We should consider prioritizing this over other planned work, or schedule it for our next iteration.
  • p4: We will evaluate this along with our other features to be planned in a future release.

I would also suggest extending incident response and including some ideas from Google and PagerDuty.

Particularly the different roles (PagerDuty, Google) for p0 incidents and the idea I mentioned about having a /on-call teamX which I found perfectly described here.

The indecent response page by PagerDuty is a great resource for general ideas an guidelines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great and necessary convo, and the goal of this initial PR is to document how we do this today. So for now I support @dadlerj 's documented process as a start, since this is how he's been operating. A clear next step is to iterate on this based on @pecigonzalo 's suggestions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Julia! Agreed with that approach.

To be transparent, I'm not directly using this methodology yet, but when I started to document my process, this is roughly how I think about my own method for communicating prioritization of issues. This just adds an official tag onto it (versus my handling in a one-off way). I'll merge this now and get feedback from Tion and Christine once it's live.

@pecigonzalo thanks for the feedback. A few notes:

I believe this list is an extension or update to the one described here https://github.com/sourcegraph/about/blob/main/handbook/ce/support.md#slas.

It's actually quite different... That page reflects the promises we make to customers—i.e., the minimum service level required per our contract—while this list reflects our internal prioritization. There clearly is an intimate connection between the two, but they're not quite the same in terms of what they specify and how we want to describe them (and never quite will be).

Here is a draft idea on how we could handle this, I would PR this to incident document:

This is a great conversation—"what do we do about an issue that was described as a given level of seriousness by CE"—but it is slightly different from what this doc is adding. I'd like to start with our definition of how serious an issue is, and you (and Julia and others of course) can nail down the translation of this into the "so what".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually quite different... That page reflects the promises we make to customers—i.e., the minimum service level required per our contract—while this list reflects our internal prioritization. There clearly is an intimate connection between the two, but they're not quite the same in terms of what they specify and how we want to describe them (and never quite will be).

I disagree, they are not that different given the agreed response and resolution times are directly related to the communicated prioritization as ce/pX and how we respond to that priority internally. The intended audience is different, but even the description wording on both is quite similar, only more detailed on this document as we have more levels.
We could provide this by having a single matrix, a linked relationship between the two or just the same name across both pages but holding different information to match each intended audience.

@dadlerj dadlerj mentioned this pull request Sep 4, 2020
@sqs sqs changed the base branch from master to main September 5, 2020 04:37
Copy link
Contributor

@pecigonzalo pecigonzalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this initial draft Dan. I will start working on the incident response updates and backlink here from the PR.

Exceptions to the principles above:

- This does not apply to our public GitHub issue tracker; instead, it only applies to [official support channels](support.md). Issues filed in GitHub are the responsibility of Engineering and/or Product.
- Certain customers pay for dedicated support from a member of the Engineering team. Responding to issues filed by these customers is a shared responsibility for the assigned Engineer and CE (whoever sees it first).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a link to the document that outlines this customers.

1. If the issue is clearly a bug or a feature request (rather than a question that can be clarified or answered on the spot), [the CE will file or add on to a GitHub issue](customer_issues.md).
1. The CE will add a prioritization label to the issue, from `user/p0` to `user/p4`, based on a combination of (1) the severity of the issue, and (2) the prioritization of the reporting company. These labels mean the following:
1. `user/p0`: The issue results in the company's Sourcegraph instance being unusable and the company is a [Tier 1 prospect or customer](../sales/index.md#segmentation).
1. `user/p1`: The issue results in partial loss of functionality or serious disruption and the company is a [Tier 1 or Tier 2 prospect or customer](../sales/index.md#segmentation).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this list is an extension or update to the one described here https://github.com/sourcegraph/about/blob/main/handbook/ce/support.md#slas.

Here is a draft idea on how we could handle this, I would PR this to incident document:

  • p0: All hands on deck, notify on #dev-ops. The incident is the highest priority for all engineering teams and should drop any other tasks to work on the incident as required by the incident owner (not all at once)
  • p1: An engineer should prioritize this over all other tasks and transfer to other teams as required.
  • p2: We will resolve in a best-efforts basis while we continue to work on our planned release.
  • p3: We should consider prioritizing this over other planned work, or schedule it for our next iteration.
  • p4: We will evaluate this along with our other features to be planned in a future release.

I would also suggest extending incident response and including some ideas from Google and PagerDuty.

Particularly the different roles (PagerDuty, Google) for p0 incidents and the idea I mentioned about having a /on-call teamX which I found perfectly described here.

The indecent response page by PagerDuty is a great resource for general ideas an guidelines.


- This does not apply to our public GitHub issue tracker; instead, it only applies to [official support channels](support.md). Issues filed in GitHub are the responsibility of Engineering and/or Product.
- Certain customers pay for dedicated support from a member of the Engineering team. Responding to issues filed by these customers is a shared responsibility for the assigned Engineer and CE (whoever sees it first).
- If an engineer sees a new question or issue come in from a company that they've already been introduced to, or if the question is in their direct area of expertise, they are encouraged to jump in directly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An engineer should always check in with a CE before replying if it's not a conversation they're already having (but a new thread or ticket). If it's an existing thread the engineer should be the one to triage and either continue responding or let the CE owner know that despite it being the same ticket or thread, it's a new request, and they'd like to hand it back over to the CE for ownership.


## Engineering responsibilities

1. If an Engineer agrees to take on an issue or a ticket, they must be willing to follow-through on the problem until it is addressed. If they are not willing or able to do so, they must notify the CE as soon as possible so someone else can be assigned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or alternatively they can find someone else to tap in, and let the CE know.

@dadlerj dadlerj merged commit 8605d07 into main Sep 30, 2020
@dadlerj dadlerj deleted the ce-eng branch September 30, 2020 05:36
@tistru
Copy link
Contributor

tistru commented Sep 30, 2020

Hi @dadlerj . Great work putting this together and I believe it will really helpful. I didn't have any feedback on the current version of the document. I understand we will iterate on it going forward but I am perfectly fine to operate under these guidelines going forward :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants