Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Distribution 3.21 Tracking issue #13675

Closed
33 of 55 tasks
pecigonzalo opened this issue Sep 7, 2020 · 24 comments
Closed
33 of 55 tasks

Distribution 3.21 Tracking issue #13675

pecigonzalo opened this issue Sep 7, 2020 · 24 comments
Assignees
Labels
Milestone

Comments

@pecigonzalo
Copy link
Contributor

pecigonzalo commented Sep 7, 2020

Plan

Support new and existing deployments

This is an ongoing expense, we anticipate this taking no more than 10d of work spread across the entire team.

Support Security in deploying a log analysis tool

Security is planning to deploy a centralized logging and analysis system and will require our assistance to setup and review this new infrastructure.

Implement 2+ sourcegraph.com services using dhall

sourcegraph.com sees the highest amount of Kubernetes changes out of all of our deployments + deploy-sourcegraph. Scoping a single component limits the customizations that we need to implement and allows for easier onboarding other engineers.

Releases are created in a single day

We have a goal of reducing the time it takes to create releases, and this current several-day system has encouraged us to view releases as “baked” rather than “snapshots of the main branch”, leading to situations where main is broken and we have to retrospectively fix it or adding last minute features.

Split infrastructure into separate GCP projects

GCP utilizes project wide roles and permissions, to ensure resources are isolated from each other and reduce the blast radius of changes, we should split resources into separate projects. Additionally, this will grant us more insight into our infrastructure costs and will become more important as we grow and expand it.

Availability

Period is from September 20th to October 19th (21 working days). Please write the days you won't be working and the number of working days for the period.

  • Gonza: 19d (23rd Sept and TBD)

Tracked issues

@unassigned: 5.00d

Completed: 5.00d

  • (🏁 14 days ago) run "e2e regression tests" in CI once/day, even if they fail all the time (#13876) 5.00d

@bobheadxi: 8.50d

  • on-call: document actions to follow up on critical alerts (#1468)

Completed: 8.50d

  • (🏁 36 days ago) renovate-downstream: refine action trigger (#13842)
  • (🏁 23 days ago) release steps: stop posting milestone triage messages (#13871) 2.00d
  • (🏁 22 days ago) dogfood-k8s: finalize migration over to new cluster (#13792) 1.00d
  • (🏁 17 days ago) release steps: automate CHANGELOG version header creation (#13873) 2.00d
  • (🏁 17 days ago) release steps: do not verify CHANGELOG entries (#13872) 0.50d
  • (🏁 16 days ago) release steps: roll deploy-sourcegraph PR creation into yarn run release release:publish (#14242) 1.00d
  • (🏁 16 days ago) managed-instances: deploy a demo instance (#13604) 1.00d
  • (🏁 15 days ago) release steps: stop announcing release candidates (#13875) 0.50d
  • (🏁 10 days ago) release steps: stop posting messages about branch cut in Slack (#13869) 0.50d
  • (🏁 7 days ago) release: command naming and behaviour is inconsistent (#14623)

@davejrt

  • Run QA tests on bare-metal Buildkite agents on every commit to master (non-blocking) (#12340)
  • blackbox exporter & site 24/7 next steps (#13627) 🧶
  • sourcegraph/customer (#111) 👩

Completed

  • (🏁 119 days ago) Bigdata customer Tracking issue (#11717)
  • (🏁 24 days ago) Run e2e tests on bare-metal Buildkite agents on every commit to master (non-blocking) (#12339)

@daxmc99: 4.00d

  • explore making it easier to run Kubernetes cluster QA tests (or relax to just smoke tests) (#13878) 4.00d

@efritz

  • docs: Update pure-docker upgrading docs (#14671) :shipit:

@ggilmore

  • ci: build and pin tool apks in CI for release (#13297) 🧶
  • write instructions for how to modify sourcegraph.com's dhall generation pipleine (#14136)
  • write developer friendly documentation for deploy-sourcegraph-dhall architecture (#14135)
  • add "symbols" to service deploy-sourcegraph-dhall, with support for sourcegraph.com customizations (#14130)
  • dhall: use dhall on sourcegraph.com (#13340)

Completed

  • (🏁 13 days ago) sourcegraph/customer (#110) 👩

@pecigonzalo: 23.00d

  • blackbox exporter & site 24/7 next steps (#13627) 🧶
  • sourcegraph/customer (#108) 👩

Completed: 23.00d

  • (🏁 28 days ago) Move the CI e2e runner to the CI project (#13919) 1.00d
  • (🏁 24 days ago) Move the CI cluster to the CI project (#13920) 3.00d
  • (🏁 23 days ago) Move the single container dogfood deployment to the dogfood k8s cluster (#13916) 2.00d
  • (🏁 23 days ago) Delete the big data clusters (#13918) 5.00d
  • (🏁 22 days ago) dogfood-k8s: finalize migration over to new cluster (#13792) 1.00d
  • (🏁 15 days ago) Remove the -tooling cluster from the production project (#13917; PRs: #1719) 3.00d
  • (🏁 10 days ago) sourcegraph/customer (#105) 8.00d 👩

@slimsag: 15.00d

  • sourcegraph/customer (#71) 👩
  • sourcegraph/customer (#49) 0.50d 👩
  • sourcegraph/customer (#97) 👩

Completed: 14.50d

  • (🏁 31 days ago) sourcegraph/customer (#104) 👩
  • (🏁 29 days ago) Create a dev/testing managed instance (#14075)
  • (🏁 24 days ago) Run e2e tests on bare-metal Buildkite agents on every commit to master (non-blocking) (#12339)
  • (🏁 16 days ago) Remove syntax highlighting for GraphQL, INI file, TOML, and Perforce (#13933)
  • (🏁 16 days ago) release steps: make Product team self-sufficient (#13868) 0.50d
  • (🏁 8 days ago) License report for syntect_server & its dependencies (#11269) 1.00d 👩
  • (🏁 7 days ago) Document when to introduce new services or not (#5487) :shipit:
  • (🏁 3 days ago) engineering: document when to (or not to) introduce a new service (#1769) :shipit:
  • (🏁 3 days ago) distribution: add monitoring architecture page (#1221) :shipit:
  • (🏁 3 days ago) Improve reliability of QA tests (#13880) 12.00d
  • (🏁 3 days ago) Document QA test commands (#14632) 1.00d

@uwedeportivo: 9.50d

  • sourcegraph.com: write bot to incorporate image tag updates in dhall pipeline (#14133) 1.50d
  • add deploy-sourcegraph-dhall pipeline to deploy-sourcegraph-dot-com (#14132) 1.00d
  • add gitserver to deploy-sourcegraph-dhall, with support for sourcegraph.com customizations (#14131) 4.00d
  • dhall: generate separate yaml files for each "component" instead of one large one (#13338) 2.00d
  • deploy-sourcegraph: restricted integration test fails with Kubernetes 1.16+ (#14728)
  • dhall: use dhall on sourcegraph.com (#13340)

Completed: 1.00d

  • (🏁 22 days ago) Repo-updater component always outputs debug logs (#13191) 1.00d 👩🎩

Legend

  • 👩 Customer issue
  • 🐛 Bug
  • 🧶 Technical debt
  • 🎩 Quality of life
  • 🛠️ Roadmap
  • 🕵️ Spike
  • 🔒 Security issue
  • :shipit: Pull Request
@pecigonzalo pecigonzalo added this to the 3.21 milestone Sep 7, 2020
@pecigonzalo pecigonzalo changed the title WIP: Distribution 3.21 Tracking issue Distribution 3.21 Tracking issue Sep 18, 2020
@davejrt
Copy link
Contributor

davejrt commented Sep 25, 2020

Last week
Finalizing the work on e2e tests in vagrant. I had this working about 99% with what seemed like a minor issue, now I've run into something else that I need to track down where I've introduced a new bug. A classic case of it was working on my machine. Also lots of cleanup in the wake of secrets being being exposed

Next week
Finalizing the e2e tests. I discussed with Gonza some thoughts I had about our release process and I plan to document those based on my experience in 3.20. I'd also like to clarify a few things regarding the way we stop the mistake I made happening again which I plan to document and circulate with the appropriate people.

@slimsag
Copy link
Member

slimsag commented Sep 25, 2020

This week

This week I spent most of my time providing support internally and to customers, much more than usual. I did not make much headway against my planned work, but did add issues extensively for everything that came up to this milestone.

For customers, I provided extensive resource allocation advice to two major customers, and followed up extensively on ~7-8 more medium-sized customer issues before ultimately passing them off to other individuals or teams in order to reduce the number assigned to me.

Internally, I created a dev/testing managed instance and shared knowledge of them with the rest of the team in the form of updated docs, a recorded screencast, and improved tooling. I investigated ops issues with sourcegraph.com and multiple dev deployments with the team.

1:1s I had ran much longer than usual, leading to longer-form ongoing conversations. I also wrote a high-level progress summary on the Dhall work.

Next week:

I am hoping to be more heads-down and make substantial headway against my planned work, but acknowledge I have many more extensive conversations ahead of me which will be time consuming. Focus is key.

@bobheadxi
Copy link
Member

Last week

This was an extra short week for me because I took one of the mental health day things. I got k8s.sgdev.org running smoothly, and helped a bit with migrating campaigns over from the old deployment. During this I found that the deploy-sourcegraph overlay for namespaces wasn't set up for cAdvisor, so I made a PR to add one and try and improve the docs around that a bit. Also found and fixed a bug in prom-wrapper that was causing custom alerts usernames to not be set correctly.

This week

I'm a little behind on getting started with 3.21 stuff so I'll be spending extra time this week to make up for that. I'll also ping #dev-chat to ask for objections about spinning down the old k8s.sgdev.org and go ahead and do that.

@pecigonzalo
Copy link
Contributor Author

Last week
I managed to move the CI and dogfood-server clusters to separate GCP projects. The dogfood-server cluster will reuse the dogfood-k8s GKE cluster, as its a single container.

This week
Ill work on cleaning up and deleting leftover resources from the migration, and start the work to remove the bigdata cluster.

@bobheadxi
Copy link
Member

This week

Deployed demo.sourcegraph.com - last step to this is awaiting #ce followup, and made some docs updates for managed instances while at it. Opened up a couple of PRs related to 1-day releases and reducing the steps required there. Discussed the future of Cloud deployment in this thread and RFC 239.

Next week

Find out who to ping for review for release-tool PRs (would still like one for https://github.com/sourcegraph/sourcegraph/pull/14240) and use that to start working out the rest of the tasks I've picked up for the 1-day releases project. Given the frequency of requests for clarification regarding Cloud deployments, would also like to help @daxmc99 if possible with polishing up RFC 239.

@davejrt
Copy link
Contributor

davejrt commented Oct 2, 2020

Last week

e2e now running in a non-blocking capacity on main which I hope is now jsut a case of ironing out the last few bugs with some help from web ( I am confident in the infra and base image set up now). Helping out with a security scare, and the rest of my time was spent helping out on a big customer issue. Also a quick quality of life PR to manage aws service accounts with terraform. A bit of other troubleshooting here and there.

Next week

Finish e2e with the help of the web team and I am going to sync with uwe around regression testing and see how much different they are, and what effort is required to get that into a pipeline as well. I predict some significant time spent helping on customer issues too.

@slimsag
Copy link
Member

slimsag commented Oct 3, 2020

This week

Was sick from Sat <-> Thu. On Friday I spent 90% of my time catching up on things, and did other minor work like adjusting 1password permissions for managed instances, helping to debug one customer issue, and investigating critical alerts at https://app.hubspot.com/contacts/2762526/company/407948923/

Next week

Hoping to get to what I did not this week, i.e. heads-down on my planned work with >=50% of my time.

@uwedeportivo
Copy link
Contributor

uwedeportivo commented Oct 3, 2020

this week

one quality of life issue (https://github.com/sourcegraph/sourcegraph/issues/13191) done, one dhall issue almost done (https://github.com/sourcegraph/sourcegraph/issues/14133), pitched in on token rotation and had debug sessions for customer issue

next week

all the stars will align and i will work on dhall code

@ggilmore
Copy link
Contributor

ggilmore commented Oct 5, 2020

this week:

  • get spun on on working on the symbols service for deploy-sourcegraph-dhall (Tuesday, Wednesday)
  • spent a lot of time on security (Thursday)
  • spent a lot of time figuring out an important customer issue (Thursday, Friday)

next week:

  • Hopefully there won't be as much task switching so that I can get back to working on dhall

@pecigonzalo
Copy link
Contributor Author

pecigonzalo commented Oct 5, 2020

Last week

I mostly worked on the GCP Split project, deleted BigCluster and cleanup disks, deleted Megakube and moved Tooling resources to the Dogfood cluster as they are used there (Phabricator, GHE, Bitbucket, Gitolite). This including porting a bunch infrastructure to Terraform.
I have also been supporting and debugging https://github.com/sourcegraph/customer/issues/105 with @unknwon but we are currently waiting on the customer.

This week

Finish the Tooling cluster/resources move cleanup and update any relevant documentation. I need to switch back to updating our long-term goals, integrating the roadmap provided by Stephen into our goals and finishing the Distribution growth PR.

@slimsag
Copy link
Member

slimsag commented Oct 10, 2020

This week

I played catch-up on PRs, reviews, etc. after being out sick last week. I followed-up on minor tasks, like setting up demo.sourcegraph.com with Robert and restructuring our 1password vaults. I had lots of 1:1 / career growth discussions, etc. I then began to hammer out my actual planned work, removing non-OSS syntax highlighting languages and creating a super extensive/tedious license report on syntect_server and dealing with some update pains/segfaults there. To finish off my week, I took a deep dive into the QA (formerly "e2e regression") test suite and pulled in others to help address 3 release blockers I identified in the process.

Next week

We are seeing lots of QA test suite failures, some of which look like real release-blocking regressions. I will be isolating those, filing issues, and pulling in more people to fix them. At the same time, I will be focused on 3.22 planning and working with Dave and Uwe to improve QA test suite reliability.

@bobheadxi
Copy link
Member

bobheadxi commented Oct 12, 2020

Last week

Some small contributions to the CNCF repopage project: blackbox, CSS change to the logo. Landed improvements to changelog automation, deploy-sourcegraph release automation, and general release steps reductions and dry-runs for the release tool. Added support for regex silencing in observability.silenceAlerts. Investigated some k8s.sgdev.org prometheus failures and made a handbook update. Switched the default for NaN values in alerts to alleviate false alerts that have been firing on low-traffic instances like k8s.sgdev.org (and some customer test instances)

This week

Main thing I have in mind this week is keep an eye on the release process and see if any of the changes needs clarification/improvement

@uwedeportivo
Copy link
Contributor

uwedeportivo commented Oct 12, 2020

last week

this week

  • work on dhall components
  • work on regression e2e testing for 3.21 release (release week)

@pecigonzalo
Copy link
Contributor Author

@uwedeportivo could you add which components?

@davejrt
Copy link
Contributor

davejrt commented Oct 12, 2020

last week

this week

Hopefully customer issues will settle down and we can focus on internal issues. Last week uwe ran us through e2e/regression testing and I gained a lot of insight into what is infra related in the failures vs the tests themselves. This week the plan will be to get as much running as we can then identify what are issues for others teams to fix.

@pecigonzalo
Copy link
Contributor Author

Last Week
Finalized the GCP Split project, all resources are now on their appropriate projects and we closed https://github.com/sourcegraph/customer/issues/105.

This Week
Focus on planning 3.22 and working with CE regarding incident management. Ill also be taking a day off which was going to take on the 12th but did not manage to.
Ill continue to small resource cleanup as I find them in GCP and AWS.

@ggilmore
Copy link
Contributor

Last week:

Next week:

  • Focus on preparing a demo for dhall to record for customers

@slimsag
Copy link
Member

slimsag commented Oct 17, 2020

This week

A lot of conversations: changing my direction/focus, interviewing candidates, syncing with https://app.hubspot.com/contacts/2762526/company/407948923/ (alerts, upgrades, etc.) and https://app.hubspot.com/contacts/2762526/company/557692805/ (search, stability), syncing with Christina about state of product & opportunities.

A fair amount of time spent heads-down trying to debug/improve QA tests, but with few results. Its been hard for me to make progress here with lots of interruptions throughout my day and the test suite itself being so dang confusing (but also quite extensive.) I caught up with Uwe and did some pairing up on it with him.

Wanting to feel as though I made some progress other than just conversations, I switched away from QA tests mid-Thur and put my thoughts/questions around Cloud on paper, documented when to introduce new services, and merged some updates from Rob and Rijnard to improve syntax highlighting colors + add back GraphQL support.

Next week

Focus, get more heads-down time on QA tests and push the release through ASAP with Uwe and Dave.

@bobheadxi
Copy link
Member

This week

Some last minute tweaks and adjustments to release process for 3.21 (both on the release tool, and the checklist), debugged the deploy-sourcegraph CI pipeline, reviewed some monitoring PRs after noticing some flakey critical alerts on k8s.sgdev.org

Next week

Keep tabs on release process, start exploring other parts of the release pipeline (e2e, etc) and the possibilities there. Will also be exploring our options with the upcoming deployment UX project meeting. Am also adjusting my work schedule a bit, but no major changes to meetings availabilities for the most part.

@uwedeportivo
Copy link
Contributor

this week

Chased down a couple of issues with a big customer (https://github.com/sourcegraph/customer/issues/111, disk space distribution of index space, https://github.com/sourcegraph/customer/issues/116). Pitched in on release process by running regression tests and fixing them up. My Dhall language proposal hit a road block (dhall-lang/dhall-lang#1081) :-). Still working on Dhall components, progress has not been as fast as I would like.

next week

Getting 3.21 out the door is priority for the beginning of the week. Afterwards I will probably go on vacation.

@ggilmore
Copy link
Contributor

Last week:

  • Heads down on preparing / updating deploy-sourcegraph-dhall-archived

This week:

  • Record demo video for internal feedback

@pecigonzalo
Copy link
Contributor Author

Last week
Most of last week has been reviewing RFCs (RFC-239: QA Environments, RFC-245: Centralized logging, RFC-249: Secret Management), PRs and other Slack conversations. The rest of it was planning the next sprint and how to track it with our current workflow.

This week
I did not finish my review comments for RFC-239 so I would like to finish those and the plan for the next sprint. I will also sync about the delivery pipeline UX and create a goal for it.

@davejrt
Copy link
Contributor

davejrt commented Oct 19, 2020

Last week

Fighting fires with uwe on a large customer (sourcegraph/customer#111) and really battling with regression tests. Uwe and Stephen have been a big help in digging through some of this with me. I have the infrastructure in a good working state, with automation now to setup the sourcegraph instance prior to running the tests. I am still confused as to why things don't work consistently between environments, and why some tests needs to be run twice in order to work.

Next week

Top priority will be to get 3.21 released, however the regression tests are run (local or in CI). After that a write up that really identifies where the gaps are, what is broken and what can be automated.

@daxmc99
Copy link
Contributor

daxmc99 commented Oct 19, 2020

Last Week

Vacation 🌴 🚵‍♂️

This Week

Finish up remaining Cloud SQL work https://github.com/sourcegraph/sourcegraph/issues/11496,
investigate deployment pipeline UX and report back to Cloud team with our decisions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

7 participants