-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachprod: support multiple local clusters #71945
Labels
A-multitenancy
Related to multi-tenancy
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Comments
RaduBerinde
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-multitenancy
Related to multi-tenancy
labels
Oct 25, 2021
RaduBerinde
added a commit
to RaduBerinde/cockroach
that referenced
this issue
Oct 26, 2021
This change adds support for multiple local clusters. Local cluster names must be either "local" or of the form "local-foo". When the cluster is named "local", the node directories stay in the same place, e.g. `~/local/1`. If the cluster is named "local-foo", node directories are like `~/local/foo-1`. Fixes cockroachdb#71945. Release note: None
RaduBerinde
added a commit
to RaduBerinde/cockroach
that referenced
this issue
Oct 26, 2021
This change adds support for multiple local clusters. Local cluster names must be either "local" or of the form "local-foo". When the cluster is named "local", the node directories stay in the same place, e.g. `~/local/1`. If the cluster is named "local-foo", node directories are like `~/local/foo-1`. Fixes cockroachdb#71945. Release note: None
RaduBerinde
added a commit
to RaduBerinde/cockroach
that referenced
this issue
Oct 27, 2021
This change adds support for multiple local clusters. Local cluster names must be either "local" or of the form "local-foo". When the cluster is named "local", the node directories stay in the same place, e.g. `~/local/1`. If the cluster is named "local-foo", node directories are like `~/local/foo-1`. It was necessary to change the ROACHPROD environment variable convention to include the cluster name (otherwise we can't distinguish between processes on different clusters). This can cause problems if `roachprod` is updated and then used to stop processes that were started with an older version. Fixes cockroachdb#71945. Release note: None
RaduBerinde
added a commit
to RaduBerinde/cockroach
that referenced
this issue
Nov 2, 2021
This change adds support for multiple local clusters. Local cluster names must be either "local" or of the form "local-foo". When the cluster is named "local", the node directories stay in the same place, e.g. `~/local/1`. If the cluster is named "local-foo", node directories are like `~/local/foo-1`. For local clusters we include the cluster name in the ROACHPROD variable; this is necessary to distinguish between processes of different local clusters. The relevant code is cleaned up to centralize the logic related to the ROACHPROD variable. Fixes cockroachdb#71945. Release note: None meh
RaduBerinde
added a commit
to RaduBerinde/cockroach
that referenced
this issue
Nov 3, 2021
This change adds support for multiple local clusters. Local cluster names must be either "local" or of the form "local-foo". When the cluster is named "local", the node directories stay in the same place, e.g. `~/local/1`. If the cluster is named "local-foo", node directories are like `~/local/foo-1`. For local clusters we include the cluster name in the ROACHPROD variable; this is necessary to distinguish between processes of different local clusters. The relevant code is cleaned up to centralize the logic related to the ROACHPROD variable. Fixes cockroachdb#71945. Release note: None meh
RaduBerinde
added a commit
to RaduBerinde/cockroach
that referenced
this issue
Nov 3, 2021
This change adds support for multiple local clusters. Local cluster names must be either "local" or of the form "local-foo". When the cluster is named "local", the node directories stay in the same place, e.g. `~/local/1`. If the cluster is named "local-foo", node directories are like `~/local/foo-1`. For local clusters we include the cluster name in the ROACHPROD variable; this is necessary to distinguish between processes of different local clusters. The relevant code is cleaned up to centralize the logic related to the ROACHPROD variable. Fixes cockroachdb#71945. Release note: None meh
craig bot
pushed a commit
that referenced
this issue
Nov 4, 2021
57208: sql: more allocation reductions r=yuzefovich a=jordanlewis See individual commits for details. ``` name old time/op new time/op delta FlowSetup/vectorize=true/distribute=true-24 283µs ± 9% 279µs ± 6% ~ (p=0.316 n=29+30) FlowSetup/vectorize=true/distribute=false-24 275µs ± 4% 273µs ± 6% ~ (p=0.107 n=29+30) name old alloc/op new alloc/op delta FlowSetup/vectorize=true/distribute=true-24 37.9kB ± 1% 37.9kB ± 1% ~ (p=0.503 n=24+24) FlowSetup/vectorize=true/distribute=false-24 36.1kB ± 1% 36.0kB ± 0% -0.22% (p=0.009 n=25+25) name old allocs/op new allocs/op delta FlowSetup/vectorize=true/distribute=true-24 361 ± 0% 358 ± 0% -0.81% (p=0.000 n=25+26) FlowSetup/vectorize=true/distribute=false-24 348 ± 1% 346 ± 1% -0.84% (p=0.000 n=28+28) ``` 70334: sql: allow atomic name swaps r=postamar a=postamar This PR deprecates and disables usage of the draining names mechanism on catalog descriptors. When renaming or dropping a table/database/schema/type, the namespace entry is now modified in-transaction. This makes it possible to swap names in-transaction. On the other hand, names may be inconsistently resolved to either the old or the new target in specific circumstances. Fixes #54562. 71970: roachprod: support multiple local clusters r=RaduBerinde a=RaduBerinde Many of these commits are cleanup and reorganization. The functional changes are: - we no longer use ~/.roachprod/hosts; instead we use json files in ~/.roachprod/clusters - we can now have multiple local clusters, with names like local-foo - roachprod invocations using different GCE projects no longer step on each-other's toes When this PR merges, roachprod will lose track of any existing local cluster; it will need to be cleaned up and recreated. #### roachprod: minor cleanup for cloud.Cloud This change fills in some missing comments from `cloud.Cloud` and improves the interface a bit. Some of the related roachprod code is cleaned up as well. Release note: None #### roachprod: clean up local cluster metadata The logic around how the local cluster metadata is loaded and saved is very convoluted. The local provider is using `install.Clusters` and is writing directly to the `.hosts/local` file. This commit disentangles this logic: it is now up to the main program to call `local.AddCluster()` to inject local cluster information. The main program also provides the implementation for a new `local.VMStorage` interface, allowing the code for saving the hosts file to live where it belongs. Release note: None #### roachprod: clean up local cluster deletion This change moves the code to destroy the local cluster to the local provider. The hosts file is deleted through LocalVMStorage. Release note: None #### roachprod: rework clusters cache This commit changes roachprod from using `hosts`-style files in `~/.roachprod/hosts` for caching clusters to using json files in `~/.roachprod/clusters`. Like before, each cluster has its own file. The main advantage is that we can now store the entire cluster metadata instead of manufacturing it based on one-off parsing. WARNING: after this change, the information in `~/.roachprod/hosts` will no longer be used. If a local cluster exists, the new `roachprod` version will not know about it. It is recommended to destroy any local cluster before using the new version. A local cluster can also be cleaned up manually using: ``` killall -9 cockroach rm -rf ~/.roachprod/local ``` Release note: None #### roachprod: use cloud.Cluster in SyncedCluster This change stores Cluster metadata directly in SyncedCluster, instead of making copies of various fields. #### roachprod: store ports in vm.VM This change adds `SQLPort` and `AdminUIPort` fields to `vm.VM`. This allows us to remove the special hardcoded values for the local cluster. Having these fields stored in the clusters cache will allow having multiple local clusters, each with their own set of ports. Release note: None #### roachprod: support multiple local clusters This change adds support for multiple local clusters. Local cluster names must be either "local" or of the form "local-foo". When the cluster is named "local", the node directories stay in the same place, e.g. `~/local/1`. If the cluster is named "local-foo", node directories are like `~/local/foo-1`. For local clusters we include the cluster name in the ROACHPROD variable; this is necessary to distinguish between processes of different local clusters. The relevant code is cleaned up to centralize the logic related to the ROACHPROD variable. Fixes #71945. Release note: None meh #### roachprod: list VMs in parallel This commit speeds up the slowest step of roachprod: listing VMs from all providers. We now list the VMs in parallel across all providers instead of doing it serially. Release note: None #### roachprod: fix behavior when mixing GCE projects Currently roachprod has very poor behavior when used with different projects on the same host. For example: ``` shell1: GCE_PROJECT=andrei-jepsen roachstress.sh ... // this will run ~forever sometime later in shell2: roachprod sync (on the default project) ``` The sync on the default project removes the cached information for the cluster on `andrei-jepsen`, which causes `roachprod` commands against that cluster (from within the `roachstress.sh` script) to fail. We fix this by ignoring any cached clusters that reference a project that the provider was not configured for - both when loading clusters into memory and when deleting stale cluster files during `sync`. As part of the change, we also improve the output of `list` to remove the colon after the cluster name and to include the GCE project: ``` $ roachprod list --gce-project cockroach-ephemeral,andrei-jepsen Syncing... Refreshing DNS entries... glenn-anarest [aws] 9 (142h41m39s) glenn-drive [aws] 1 (141h41m39s) jane-1635868819-01-n1cpu4 [gce:cockroach-ephemeral] 1 (10h41m39s) lin-ana [aws] 9 (178h41m39s) local-foo [local] 4 (-) radu-foo [gce:andrei-jepsen] 4 (12h41m39s) radu-test [gce:cockroach-ephemeral] 4 (12h41m39s) ``` Release note: None 72071: sql: fix OIDs in RowDescription in some cases r=yuzefovich a=yuzefovich Fixes: #71891. Release note (bug fix): Previously, CockroachDB could not set the `TableOID` and `TableAttributeNumber` attributes of `RowDescription` message of pgwire protocol in some cases (these values would be left as 0), and this is now fixed. Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Marius Posta <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
RaduBerinde
added a commit
to RaduBerinde/cockroach
that referenced
this issue
Nov 15, 2021
This change adds support for multiple local clusters. Local cluster names must be either "local" or of the form "local-foo". When the cluster is named "local", the node directories stay in the same place, e.g. `~/local/1`. If the cluster is named "local-foo", node directories are like `~/local/foo-1`. For local clusters we include the cluster name in the ROACHPROD variable; this is necessary to distinguish between processes of different local clusters. The relevant code is cleaned up to centralize the logic related to the ROACHPROD variable. Fixes cockroachdb#71945. Release note: None meh
craig bot
pushed a commit
that referenced
this issue
Nov 17, 2021
72641: release-21.2: roachprod: backport changes from master as of 2021-11-11 r=RaduBerinde a=RaduBerinde This PR backports all changes involving roachprod as of 2021-11-11. There have been large refactorings which we want to backport, or it will make backporting any future necessary roachtest fixes much harder. We also want new upcoming features around multi-tenancy available for 21.2. CC @cockroachdb/release #### roachprod/vm/aws: improve help text for multiple stores ```bash roachprod create ajwerner-test -n1 --clouds aws \ --aws-ebs-volume='{"VolumeType": "io2", "VolumeSize": 213, "Iops": 321}' \ --aws-ebs-volume='{"VolumeType": "io2", "VolumeSize": 213, "Iops": 321}' \ --aws-enable-multiple-stores=true roachprod stage ajwerner-test cockroach roachprod start ajwerner-test --store-count 2 ``` The above commands will create a node with multiple stores and start cockroach on them. Hopefully these minor help changes make that clearer. Release note: None #### roachprod: add stageurl command Sometimes it is useful to be able to download these artifacts directly. For example, when trying to bisect a problem. But, the URL can take a second to remember the format of. The stageurl command prints the staging URL of the given application. I've reorganized some of the code to reduce duplication between the stage and stageurl command. There is still more duplication than I would like. But I figured I would see if this seems useful to others before further refactoring. Release note: None #### roachprod: clean up roachprod ssh keys in aws Many SSH keys created by roachprod are no longer used, and some were created by former employees. This needed to change because it's a security issue that former employees may exploit. This patch adds another step to roachprod-gc cronjob to tag any untagged keys created by roachprod in AWS and delete them if they are unused. Release note: None #### roachprod: upgrade Azure Ubuntu image to 20.04 Previously, currently used Ubuntu 18.04 doesn't support `systemd-run --same-dir`, which is used by some roachprod scripts. Additionally, GCE and AWS already use Ubuntu 20.04 based images for roachprod. Updating the base image to Ubuntu 20.04 fixes the issue above and aligns the version with other cloud providers. Release note: None #### roachprod: update azure SDK This is a partial backport of the commit below (only the part that affects roachprod). metric: Add Alert and Aggregation Rule interface In this commit, the interfaces for Alert and Aggregation rule interfaces are outlined. These interfaces will be used by a new endpoint which will expose these rules in a YAML format. This endpoint can be used by our end users to configure alerts/monitoring for CockroachDB clusters. This commit also updates the prometheus dependency in the vendor submodule. Release note: None #### roachprod: fix roachprod gc docker build Previously, the roachprod garbage collector docker image build process was using the `go get` approach to build roachprod. Currently, this method doesn't work, because it doesn't use any pinning, so the build ends up with all kind of deprecation warnings and failures. * Use multi-stage docker build in order to separate build and runtime. It also reduces the image size from 1.9G to 700M. * Build roachprod using the checked out commit SHA. * Use the Bazel build image we use in CI to build roachprod. * Use Bazel to build roachprod. * Added `cloudbuild.yaml` to publish the docker image to GCR and use a beefier instance type. * Modify the entrypoint script to set the default region, required by the AWS Go SDK library. * Add `push.sh` to script deployment. Release note: None #### roachprod: correct spelling mistakes Release note: None #### roachprod: install AWS CLI v2 for GC images Previously, after regenerating the GC docker images, roachprod stopped listing AWS as an available provider, because Debian ships with AWS CLI v1, but roachprod doesn't support it. This patch installs AWS CLI v2. Release note: None #### roachprod: making roachprod subcommands point to a new library Previously, roachprod binary interfaced directly with roachorod's functionality and there was no way for another tool to make use of that functionality. This needed to change to create a library that can be used by roachprod binary and also other tools. This patch migrates the subcommands functionality to a new library and makes the binary point to the new library. Release note: None #### roachprod: avoid flaky test due to unused functions Merging #71660 trigerred a flaky test due to unused functions. This patch avoids that test by making use of / commenting unused functions. Release note: None #### roachprod: minor cleanup for cloud.Cloud This change fills in some missing comments from `cloud.Cloud` and improves the interface a bit. Some of the related roachprod code is cleaned up as well. Release note: None #### roachprod: clean up local cluster metadata The logic around how the local cluster metadata is loaded and saved is very convoluted. The local provider is using `install.Clusters` and is writing directly to the `.hosts/local` file. This commit disentangles this logic: it is now up to the main program to call `local.AddCluster()` to inject local cluster information. The main program also provides the implementation for a new `local.VMStorage` interface, allowing the code for saving the hosts file to live where it belongs. Release note: None #### roachprod: clean up local cluster deletion This change moves the code to destroy the local cluster to the local provider. The hosts file is deleted through LocalVMStorage. Release note: None #### roachprod: rework clusters cache This commit changes roachprod from using `hosts`-style files in `~/.roachprod/hosts` for caching clusters to using json files in `~/.roachprod/clusters`. Like before, each cluster has its own file. The main advantage is that we can now store the entire cluster metadata instead of manufacturing it based on one-off parsing. WARNING: after this change, the information in `~/.roachprod/hosts` will no longer be used. If a local cluster exists, the new `roachprod` version will not know about it. It is recommended to destroy any local cluster before using the new version. A local cluster can also be cleaned up manually using: ``` killall -9 cockroach rm -rf ~/.roachprod/local ``` Release note: None #### roachprod: use cloud.Cluster in SyncedCluster This change stores Cluster metadata directly in SyncedCluster, instead of making copies of various fields. #### roachprod: store ports in vm.VM This change adds `SQLPort` and `AdminUIPort` fields to `vm.VM`. This allows us to remove the special hardcoded values for the local cluster. Having these fields stored in the clusters cache will allow having multiple local clusters, each with their own set of ports. Release note: None #### roachprod: support multiple local clusters This change adds support for multiple local clusters. Local cluster names must be either "local" or of the form "local-foo". When the cluster is named "local", the node directories stay in the same place, e.g. `~/local/1`. If the cluster is named "local-foo", node directories are like `~/local/foo-1`. For local clusters we include the cluster name in the ROACHPROD variable; this is necessary to distinguish between processes of different local clusters. The relevant code is cleaned up to centralize the logic related to the ROACHPROD variable. Fixes #71945. Release note: None meh #### roachprod: list VMs in parallel This commit speeds up the slowest step of roachprod: listing VMs from all providers. We now list the VMs in parallel across all providers instead of doing it serially. Release note: None #### roachprod: fix behavior when mixing GCE projects Currently roachprod has very poor behavior when used with different projects on the same host. For example: ``` shell1: GCE_PROJECT=andrei-jepsen roachstress.sh ... // this will run ~forever sometime later in shell2: roachprod sync (on the default project) ``` The sync on the default project removes the cached information for the cluster on `andrei-jepsen`, which causes `roachprod` commands against that cluster (from within the `roachstress.sh` script) to fail. We fix this by ignoring any cached clusters that reference a project that the provider was not configured for - both when loading clusters into memory and when deleting stale cluster files during `sync`. As part of the change, we also improve the output of `list` to remove the colon after the cluster name and to include the GCE project: ``` $ roachprod list --gce-project cockroach-ephemeral,andrei-jepsen Syncing... Refreshing DNS entries... glenn-anarest [aws] 9 (142h41m39s) glenn-drive [aws] 1 (141h41m39s) jane-1635868819-01-n1cpu4 [gce:cockroach-ephemeral] 1 (10h41m39s) lin-ana [aws] 9 (178h41m39s) local-foo [local] 4 (-) radu-foo [gce:andrei-jepsen] 4 (12h41m39s) radu-test [gce:cockroach-ephemeral] 4 (12h41m39s) ``` Release note: None #### roachprod: don't remove LOCK file We use a LOCK file during sync. We create the file, acquire an exclusive lock and at the end remove the file. The removal of the file will fail if another process was waiting for the lock. Also, there is a race where we could be deleting the file that is in use by another process, and that would allow a third process to create the file again. To fix these issues, we let the LOCK file be; there is no need to remove it - we are relying on `flock`, not on exclusive file creation. Release note: None #### roachprod: fix improperly wrapped errors Partial backport of this commit: *: fix improperly wrapped errors I'm working on a linter that detects errors that are not wrapped correctly, and it discovered these. Release note: None #### roachprod: fix `roachprod start` ignoring --binary flag Merging #71660 introduced a bug where roachprod ignores --binary flag when running `roachprod start`. This patch reverts to the old way of setting config.Binary. Release note: None Fixes #72425 #72420 #72373 #72372 #### roachprod: update doc on local clusters The behavior changed in #71970. Release note: None #### pkg/roachprod: allow multiple-stores to be created on GCP Port an existing flag from the AWS roachprod flags that allows multiple stores to be created. When this flag is enabled, multiple data directories are created and mounted as `/mnt/data{1..N}`. Standardize the existing ext4 disk creation logic in the GCE setup script to match the AWS functionality. Interleave the existing ZFS setup commands based on the `--filesystem` flag. Fix a bug introduced in #54986 that will always create multiple data disks, ignoring the value of the flag. This has the effect of never creating a RAID 0 array, which is the intended default behavior. The ability to create a RAID 0 array on GCE VMs is required for the Pebble write-throughput benchmarks. Release note: None #### roachprod: move quiet determination out of the library Moving the logic of automatically enabling Quiet in non-terminal output. Release note: None #### roachprod: clean up use of SyncedCluster `SyncedCluster` is currently used to pass the cluster name (with optional node selector) and the settings. This is a misuse of the type and complicates things conceptually. This change separates out the relevant settings into a new struct `ClusterSettings`. All commands now pass the cluster name and the `ClusterSettings` instead of passing a `SyncedCluster`. Release note: None Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Ahmad Abedalqader <[email protected]> Co-authored-by: Rail Aliiev <[email protected]> Co-authored-by: rimadeodhar <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: Nick Travers <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-multitenancy
Related to multi-tenancy
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
This issue tracks adding support for multiple local clusters to roachprod. This is necessary for multi-tenant support.
The text was updated successfully, but these errors were encountered: