Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Distribution 3.20 Tracking issue #12836

Closed
38 of 49 tasks
pecigonzalo opened this issue Aug 7, 2020 · 21 comments
Closed
38 of 49 tasks

Distribution 3.20 Tracking issue #12836

pecigonzalo opened this issue Aug 7, 2020 · 21 comments
Assignees
Labels
Milestone

Comments

@pecigonzalo
Copy link
Contributor

pecigonzalo commented Aug 7, 2020

Plan

Support new and existing deployments

This is an ongoing expense, we anticipate this taking no more than 10d of work spread across the entire team.

Support teams migration to per-team alerts

We have enabled per-team alerts and on-call rotations on 3.19, as teams onboard to this new workflow we will need to provide support and guide them through the transition.

Reduce upgrade overhead

We decided to move forward with the Dhall implementation and we will work on defining a roadmap for it (spike). Additionally, we will work on a "yaml-to-dhall" and define a Dhall architecture that supports customizations (spike).

Enable failed e2e test notifications and blocking the pipeline

Although we run the e2e tests daily, their failures are not visible to all which means at the end of the iteration we must ask teams to pitch in on short notice.
We will inline our e2e tests and notify engineers when a merge breaks our e2e tests to ensure our main branch is always in a working state.

Dogfood Kubernetes deployments

Deployments to dogfood-k8s are automated from our latest images and reflect our customer’s workflow. deploy-sourcegraph is kept up to date with our latest images.

Availability

Period is from August 20th to September 19th (22 working days). Please write the days you won't be working and the number of working days for the period.

  • Gonza: 22d

Tracked issues

@bobheadxi

  • RFC-189: follow up with distribution, cloud, code-intel, search to set up opsgenie rotations #12899
  • monitoring: remove custom alertmanager from cloud #12160
  • monitoring: align ObservableOwners with current teams #13075
  • k8s.sgdev.org: automate pull requests from deploy-sourcegraph #13121
  • bundled alertmanager does not start up correctly in some environments with clustering enabled #13079 🐛
  • cadvisor: investigate collecting IO metrics #12163
  • deploy-sourcegraph: have Renovate apply image updates as soon as they are built #13122
  • on-call: document actions to follow up on critical alerts #1468
  • dogfood-k8s: finalize migration over to new cluster #13792
  • k8s.sgdev.org: reset deploy-sourcegraph-dogfood-k8s using deploy-sourcegraph #13120

@davejrt

  • baremetal buildkite agent networking / instability issues #12996
  • Bare-metal Buildkite agents capable of running Docker and VMs #12101
  • dhall: create "base" version of records and explicitly import into templating logic #13336
  • dhall: customization architecture discussion #13335

@daxmc99

  • sourcegraph/security-issues #95

@ggilmore

  • ci: build and pin tool apks in CI for release #13297 🧶
  • dhall: create "base" version of records and explicitly import into templating logic #13336
  • dhall: customization architecture discussion #13335

@keegancsmith

  • sourcegraph/customer #96 👩

@pecigonzalo

  • Document and pin Git version requirement #13168 🎩
  • Reduce the impact of unplanned work #11904
  • k8s.sgdev.org: reset deploy-sourcegraph-dogfood-k8s using deploy-sourcegraph #13120
  • Codecov coverage checks taking a long time #13695
  • dogfood-k8s: finalize migration over to new cluster #13792

@slimsag

  • sourcegraph/customer #71 👩
  • Update all images to alpine:3.12 #13035 🧶
  • sourcegraph/customer #94 👩
  • Prevent usages of alpine with a linter #13247 🎩
  • sourcegraph/customer #90 🐛👩
  • sourcegraph/customer #74 👩
  • Run e2e tests on bare-metal Buildkite agents on every commit to master (non-blocking) #12339
  • Run e2e "regression" tests on bare-metal Buildkite agents on every commit to master (non-blocking) #12340
  • Put plan forward: Push site admins to use Docker Compose or Kubernetes for production deployments #11828
  • License report for syntect_server & its dependencies; remove syntaxes with questionable licenses #11269 1d 👩
  • sourcegraph/customer #97 👩
  • Stephen's Dhall PoC #13798
  • sourcegraph/customer #101 👩
  • sourcegraph/customer #96 👩
  • sourcegraph/customer #95 👩
  • managed instances: overhaul / clarify / update customer-facing docs #13705 :shipit:
  • distribution: answer "Why is there not a stable or latest Docker image tag?" #1435 :shipit:

@uwedeportivo: 2.00d

  • sourcegraph/customer #95 👩
  • Repo-updater component always outputs debug logs #13191 1d 👩🎩
  • create deploy-sourcegraph to dhall record migration tool #13306 2d
  • docker compose deployment ssh configuration docs #13465
  • dhall: create "base" version of records and explicitly import into templating logic #13336
  • dhall: customization architecture discussion #13335
  • sourcegraph/customer #82 👩
  • sourcegraph/customer #96 👩

Legend

  • 👩 Customer issue
  • 🐛 Bug
  • 🧶 Technical debt
  • 🎩 Quality of life
  • 🛠️ Roadmap
  • 🕵️ Spike
  • 🔒 Security issue
  • :shipit: Pull Request
@pecigonzalo pecigonzalo added this to the 3.20 milestone Aug 7, 2020
@pecigonzalo pecigonzalo changed the title WIP: Distribution 3.20 Tracking issue Distribution 3.20 Tracking issue Aug 20, 2020
@slimsag
Copy link
Member

slimsag commented Aug 29, 2020

This week:

Overall not as productive a week as I would've liked, but I'm content. Addressed debt (Alpine 3.12 update in all images, unifying base docker images, adding linters to use it always), investigated baremetal buildkite agents with Dave and settled on final solution for running e2e and Docker tests, discussed next steps on Firecracker code intel, reviewed some RFCs, learned some Dhall to keep up to speed with it, and helped customers via https://github.com/sourcegraph/customer/issues/96 https://github.com/sourcegraph/customer/issues/99 https://github.com/sourcegraph/customer/issues/90 and had syncs/calls with https://sourcegraph.slack.com/archives/CMB6K7SMN/p1598639367073200 https://sourcegraph.slack.com/archives/CMB6K7SMN/p1598644460077400 as well as an interview and others.

Next week:

More code reviews and RFCs to review, then hopefully I'll start on my 3.20 planned work.

@uwedeportivo
Copy link
Contributor

update:

worked on dhall deploy-sourcegraph. have to admit, it's a lot of fun. geoffrey and i landed the migration tool. playing with the idea of ingesting a k8s dhall schema in golang and transforming it so that lists become records with keys extracted from some specific fields in the list elements. we need this to get to these specific elements. you can access list elements by index but finding out which element you are currently looking at is tricky because dhall intentionally limits you so you are forced to compose stronger schemas (lists are weak). so far i think this is the only substantial hurdle. we also want to decide if we want to be as detailed as the current k8s schemas or more simplified. both approaches have advantages and disadvantages.

@ggilmore
Copy link
Contributor

Update:

@bobheadxi
Copy link
Member

last week

Worked on a GitHub Actions-based workflow for driving updates from deploy-sourcegraph to deploy-sourcegraph-k8s-dogfood-2 (including feedback on failure), which is pretty much complete. Worked with Gonza on identifying relevant changes for the original dogfooding environment. Attempted to update the k8s.sgdev.org, but was unable to get the deploy to work - will probably abandon the effort in favour of deploying the new dogfood config. Set up PRs for the remaining issues in the dogfooding project (https://github.com/sourcegraph/sourcegraph/pull/13449 , https://github.com/sourcegraph/deploy-sourcegraph-dot-com/pull/3304 ) but this requires a bit more work / validation - spent a lot of time going over renovate docs and trying to figure out how to best set up deploy-sourcegraph being kept up to date. Looked into various alerts that have been coming through #opsgenie and tried to follow up on some of them. Looked at dhall stuff too as Geoffrey noted

this week

I've been waiting for an update to come in to deploy-sourcegraph to try out the test PR to mark it as done, but I'll figure out a good way to just go ahead and verify that this week and mark it as complete. Will also be trying to wrap up work on the new dogfood environment (including deploying it). I would like to write some docs for sourcegraph/about#1468 as well, since following up on the new alerts has been a bit of a topic recently, and possibly treating this as my "debt" ticket.

I am also transitioning to a new part-time schedule

@davejrt
Copy link
Contributor

davejrt commented Aug 31, 2020

Last week

Worked on closing out #12101 and resolving the subsequent issues with virtualbox running on GCP. The decision to sue the vagrant-google plugin has been made as it's a lot more reliable, requires less resources than local testing and is a lot quicker.

Met with Geoffrey and began proper on boarding to work on Dhall. Some addition work to upgrade the bigdata cluster and assist $CUSTOMER with an upgrade

This week

working on any last minute feedback for sourcegraph/deploy-sourcegraph-docker#141 and beginning work on Dhall with the help of Uwe and Geoffrey. This will intially be a smaller project then onto more assigned issues as delegated by Geoffrey.

@pecigonzalo
Copy link
Contributor Author

Last week

Last week was mostly a technical week, I worked through bugs, issues and alerts to try and reduce the number of events we receive daily. We identified a problem with some searches which were used in saved searches causing them scan all repositories.
I have also worked with @bobheadxi on the new dogfood instances and will continue that work this week.

This week

Ill continue to work on how to improve our API for support requests ass well as how we prioritize them to ensure we are working effectively. On a related task I would like to define how we prioritize the backlog, which relates to this as it affects how we revisit tasks/issues that were not set as high priority.
I have also started to work on a draft of our long-term team objectives and how we scale the team as we grow.

Team update

We have closed the per-team alerts project 🎉!

@slimsag
Copy link
Member

slimsag commented Sep 4, 2020

This week

Learned some more Dhall and discussed architecture plan with everyone, put forward RFCs to deprecate single container deployments, sync'd with Bunny on proposal to stop versioning our docs by branches, and had other regular meetings. Around Wed I had some personal / cat issues and had to take off Thur and Fri.

Next week

I am hoping to make forward progress on my assigned issues with the aim of completing everything assigned to me, I think it is still a reasonable workload currently and am optimistic things in my personal life will calm down soon, but will have to play it by ear a bit.

@davejrt
Copy link
Contributor

davejrt commented Sep 5, 2020

This week

Closed out the first phase of e2e testing for deploy-sourcegraph-docker with the help of Gonza and Stephen. The rest of my time has been spent learning dhall and syncing with Geoffrey and Uwe who have been really helpful in helping me level up

Next Week

There are still some outstanding tasks related to e2e testing, which I will clarify with Stephen. I'm hoping to attack some of the tasks in the the Dhall POC. In addition to this I want to work with Gonza and Robert on our next steps related to monitoring (site24x7 and blackbox exporer) and how we get the best out of both.

@bobheadxi
Copy link
Member

this week

Attempting my new part-time schedule. Verified dogfood PR automation is working (https://github.com/sourcegraph/deploy-sourcegraph-dogfood-k8s-2/pull/20) and landed some work on improving Sourcegraph -> deploy-sourcegraph image updates, but this doesn't seem to be working as expected currently. Made various changes to docs (updated release process, fixing links, reorganizing to add space for more deployment details). Did some Dhall learning (task, reading up about ideas for architecture)

next week

Wrap up work on dogfooding environment - finalize the image update changes, and have scheduled time with @pecigonzalo to run through deployment of the new environment. Did not get around to sourcegraph/about#1468 this week, and still have my eyes on that, as well as exploring Dhall PoC tasks.

@ggilmore
Copy link
Contributor

ggilmore commented Sep 5, 2020

Last week:

next week:

@uwedeportivo
Copy link
Contributor

last week:

next week:

@pecigonzalo
Copy link
Contributor Author

Last week

I started to work on our long-term objectives and scaling the team, but did not make the progress I wanted to. Ill continue to work on it this week. Dan opened an initial draft of our escalation process and I intend to draft a PR to update our incident response and support "on-call/hero" rotation as discussed during our sync.

This week

As I was unable to make sufficient progress last week, this week's focus remain the same for the most part.

Team update

We will additionally kick-off planning 3.21.

@davejrt
Copy link
Contributor

davejrt commented Sep 14, 2020

Last week

More or less fully on e2e testing. Setting up the e2e tests running on a vagrant box instead of the unreliable docker in docker solution. A little bit of dhall but to be honest, I feel as though I've dropped the ball on that and given my own desire in getting e2e running, I have been more of a passenger with Dhall.

This week

Close out e2e testing and catch any low hanging fruit with Dhall. Also spoke with Gonza about our next steps with site24x7 vs blackbox exporter which may be pulled in for 3.20 more than likely pushed out to 3.21 as tech debt.

@davejrt
Copy link
Contributor

davejrt commented Sep 14, 2020

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.20 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

@pecigonzalo
Copy link
Contributor Author

Last week
I have created a draft update of our goals and started a document to update our incident management process as a follow-up to Dan's sourcegraph/about#1521.
We also deployed a new dogfood cluster with @bobheadxi which will replace the old Pulumi based deployment.

This week
Ill continue to work on the goals adding some time estimates and other goals in our backlog as well as ad draft update to our incident management documents. I would also like to finish the new dogfood deployment by the end of the week.

Team update
We have bumped some items (#1221 and #5487) to 3.21 as we shift focus to some customer issues and reviewing the Dhall implementation architecture.

@slimsag
Copy link
Member

slimsag commented Sep 14, 2020

Last week

Interviewed Cloud and CE candidates, overhauled customer-facing managed instance docs, brainstormed LSIF postgres move with Eric. Chatted with https://app.hubspot.com/contacts/2762526/company/861679490/ about multi-region deployments and more. Figured out next steps of Dhall with Uwe and Geoffrey and swapped some of my planned work to help out further there, spending about ~6h total on my Dhall PoC.

This week

Close out my planned work for this iteration, 3.21 planning.

@bobheadxi
Copy link
Member

bobheadxi commented Sep 15, 2020

Last week

Got really sucked up into dogfooding with various issues with GitHub Actions payloads and formatting, Renovate confusions and bug, and actually deploying the whole thing with @pecigonzalo . The first two ended up being very time consuming due to the tedious nature of testing this kind of stuff (act not being a perfect simulation of Actions, Renovate needing to be set up in the target repository before things can be tested, bug was rather hard to trace down).

This week

Finalize dogfood deployment (https://github.com/sourcegraph/sourcegraph/issues/13792) and landing the Renovate bug fix (renovatebot/renovate#7274) to close out the remaining issues for this iteration, and start looking at 3.21 tasks. Also merge the new release steps (sourcegraph/about#1517)

@ggilmore
Copy link
Contributor

Last week:

  • Decided with @uwedeportivo to onboard @slimsag to the dhall work
  • Did some 3.21 planning
  • Helped investigate bugs with ds-to-dhall and various tools
  • Started work on my dhall proof of concept

This week:

  • More 3.21 planning
  • Heads down on my dhall proof of concept. I'm worried that I'm kind of going in circles with this one, but I think I'll have something to show even if it's rough.

@bobheadxi
Copy link
Member

last week

I spent most of the week finalizing the work on the new k8s.sgdev.org deployment with @pecigonzalo 's help, and have wrapped up most of that work and made the DNS switch to have the k8s.sgdev.org domain point to the new deployment (announcement). I wrote up documentation updating our information about our existing deployments as well as adding details about the new dogfood cluster.

this week

I took a look at some of the 3.21 tasks for single-day releases, and will start tackling some of them this week. I'll also figure out how to fix one last outstanding issue with the k8s.sgdev.org deployment (thread) and given no complaints, spin down the old cluster to close out https://github.com/sourcegraph/sourcegraph/issues/13792 .

@ggilmore
Copy link
Contributor

last week:

  • 3.21 Planning
  • prepared Dhall proof of concept for architecture meeting

next week:

  • pick services for sourcegraph.com dhall migration work
  • identify substeps for each service (choose customizations, documentation, etc.)

@pecigonzalo
Copy link
Contributor Author

pecigonzalo commented Sep 21, 2020

Last week
We closed the plan for 3.21 and merged [updated team goals](https://github.com/sourcegraph/about/pull/1553]. I have also started a draft for and development tools team (final name TBD), its vision and an initial job description.

Next week
Start working on the GCP split project, by moving the e2e CI resources to a new project. I would also like to finish the draft for development tools team.

Team updates
We started the next phase of Reduce upgrade overhead and will start implementing services in 3.21, we have also folded Enable failed e2e test notifications and blocking the pipeline pending tasks into https://github.com/orgs/sourcegraph/projects/90, as they are a pre-requisite for that project, but the main goal has been achieved which was having a stable platform/workflow to run e2e jobs.
The Dogfood Kubernetes deployments is 99% done, the only task left is to tear down the old cluster.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants