-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: using custom AdminUIPort breaks automated prometheus scraping #117125
Closed
Comments
srosenberg
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-testeng
TestEng Team
labels
Dec 28, 2023
cc @cockroachdb/test-eng |
DarrylWong
added a commit
to DarrylWong/fork
that referenced
this issue
Dec 28, 2023
The change cockroachdb#114097 removed default port assumptions for SQLPort and AdminUIPort, and instead finds open ports to assign. However, Prometheus scraping relies on this hardcoded AdminUIPort as it can't discover the port itself. This change temporarily reverts AdminUIPort back to the hardcoded port. Release note: none Epic: none Informs: cockroachdb#117125
craig bot
pushed a commit
that referenced
this issue
Dec 28, 2023
117141: roachprod: revert AdminUIPort back to DefaultAdminUIPort r=srosenberg a=DarrylWong The change #114097 removed default port assumptions for SQLPort and AdminUIPort, and instead finds open ports to assign. However, Prometheus scraping relies on this hardcoded AdminUIPort as it can't discover the port itself. This change temporarily reverts AdminUIPort back to the hardcoded port. Release note: none Epic: none Informs: #117125 Co-authored-by: DarrylWong <[email protected]>
DarrylWong
added a commit
to DarrylWong/fork
that referenced
this issue
Jan 8, 2024
…tart-node-one This test relies on assumption `adminUIPort = SQLPort + 1` but adminui is temporarily hardcoded as 26258 while SQLPort will be dynamically assigned without the following override. This can be reverted once cockroachdb#117125 is resolved. Release note: None
DarrylWong
added a commit
to DarrylWong/fork
that referenced
this issue
Mar 7, 2024
The change cockroachdb#114097 removed default port assumptions for SQLPort and AdminUIPort, and instead finds open ports to assign. However, Prometheus scraping relies on this hardcoded AdminUIPort as it can't discover the port itself. This change temporarily reverts AdminUIPort back to the hardcoded port. Release note: none Epic: none Informs: cockroachdb#117125
cockroach-dev-inf
pushed a commit
that referenced
this issue
May 6, 2024
In order to support multiple tenants on the same host, a unique, custom port can be assigned for DefaultAdminUIPort. This breaks because the port is hard-coded in prometheus config for scraping. The change is to have a http server running in prometheus server that can dynamically update the configuration when a new cluster is brought up. More details of the solution is explained in the page - https://cockroachlabs.atlassian.net/wiki/spaces/~7120207825326fb5e546c194029506f2c5335e/pages/3458531376/Dynamic+Scrape+Configs+on+Prometheus+for+Roachprod Fixes: #117125 Epic: none
cockroach-dev-inf
pushed a commit
that referenced
this issue
May 7, 2024
In order to support multiple tenants on the same host, a unique, custom port can be assigned for DefaultAdminUIPort. This breaks because the port is hard-coded in prometheus config for scraping. The change is to have a http server running in prometheus server that can dynamically update the configuration when a new cluster is brought up. More details of the solution is explained in the page - https://cockroachlabs.atlassian.net/wiki/spaces/~7120207825326fb5e546c194029506f2c5335e/pages/3458531376/Dynamic+Scrape+Configs+on+Prometheus+for+Roachprod Fixes: #117125 Epic: none
nameisbhaskar
added a commit
to nameisbhaskar/cockroach
that referenced
this issue
May 7, 2024
In order to support multiple tenants on the same host, a unique, custom port can be assigned for DefaultAdminUIPort. This breaks because the port is hard-coded in prometheus config for scraping. The change is to have a http server running in prometheus server that can dynamically update the configuration when a new cluster is brought up. More details of the solution is explained in the page - https://cockroachlabs.atlassian.net/wiki/spaces/~7120207825326fb5e546c194029506f2c5335e/pages/3458531376/Dynamic+Scrape+Configs+on+Prometheus+for+Roachprod Fixes: cockroachdb#117125 Epic: none
craig bot
pushed a commit
that referenced
this issue
May 9, 2024
123619: roachprod: support for dynamic admin url port r=srosenberg,renatolabs,herkolategan a=nameisbhaskar In order to support multiple tenants on the same host, a unique, custom port can be assigned for DefaultAdminUIPort. This breaks because the port is hard-coded in prometheus config for scraping. The change is to have a http server running in prometheus server that can dynamically update the configuration when a new cluster is brought up. More details of the solution is explained in the page - https://cockroachlabs.atlassian.net/wiki/spaces/~7120207825326fb5e546c194029506f2c5335e/pages/3458531376/Dynamic+Scrape+Configs+on+Prometheus+for+Roachprod Fixes: #117125 Epic: none Co-authored-by: Bhaskarjyoti Bora <[email protected]>
nameisbhaskar
added a commit
to nameisbhaskar/cockroach
that referenced
this issue
May 13, 2024
A recent change to support dynamic cluster port was introduced. But, that was not considering secured clusters This change supports secure cluster Fixes: cockroachdb#117125 Epic: none
nameisbhaskar
added a commit
to nameisbhaskar/cockroach
that referenced
this issue
May 13, 2024
A recent change to support dynamic cluster port was introduced. But, that was not considering secured clusters This change supports secure cluster Fixes: cockroachdb#117125 Epic: none
nameisbhaskar
added a commit
to nameisbhaskar/cockroach
that referenced
this issue
May 13, 2024
A recent change to support dynamic cluster port was introduced. But, that was not considering secured clusters This change supports secure cluster Fixes: cockroachdb#117125 Epic: none
nameisbhaskar
added a commit
to nameisbhaskar/cockroach
that referenced
this issue
May 14, 2024
A recent change to support dynamic cluster port was introduced. But, that was not considering secured clusters This change supports secure cluster Fixes: cockroachdb#117125 Epic: none
craig bot
pushed a commit
that referenced
this issue
May 14, 2024
124010: roachprod: secure clusters in dynamic admin url port r=DarrylWong,vidit-bhat,renatolabs a=nameisbhaskar A recent change to support dynamic cluster port was introduced. But, that was not considering secured clusters This change supports secure cluster Fixes: #117125 Epic: none Co-authored-by: Bhaskarjyoti Bora <[email protected]>
nameisbhaskar
added a commit
to nameisbhaskar/cockroach
that referenced
this issue
May 15, 2024
Due to the complexity of fetching the secrets from the secrets manager, the secrets are now maintained in cloud storage. Fixes: cockroachdb#117125 Epic: none
nameisbhaskar
added a commit
to nameisbhaskar/cockroach
that referenced
this issue
May 16, 2024
Due to the complexity of fetching the secrets from the secrets manager, the secrets are now maintained in cloud storage. Fixes: cockroachdb#117125 Epic: none
craig bot
pushed a commit
that referenced
this issue
May 16, 2024
124099: roachprod: fetch secrets from cloud store r=srosenberg a=nameisbhaskar Due to the complexity of fetching the secrets from the secrets manager, the secrets are now maintained in cloud storage. Fixes: #117125 Epic: none Co-authored-by: Bhaskarjyoti Bora <[email protected]>
miyamo2
added a commit
to miyamo2/cockroach
that referenced
this issue
May 17, 2024
mod comment lease: remove outdated NightlyStress call This test config is not used anymore. Release note: None roachtest: increase timeout of acceptance/multitenant-multiregion This has flaked due to timeout in CI. Release note: None sqlerrors: move gloval var errNoZoneConfigApplies into sqlerrors package This commit is a mechnical change that moves the gloval variable `errNoZoneConfigApplies` from package `sql` to package `sqlerrors`, as it will be used by the DSC when we support zone config in it. Informs: cockroachdb#117574 Release note: None scbuildstmt: Rerewrite a few function with better APIs This commit cleaned up a few functions using the preferred .FilterXxx() API to retrieve element from an element set. No functional change is introduced. Informs: cockroachdb#117574 Release note: None roachtest: update dsc job compat to mixed version v241-v242 This patch updates the dsc job compat roachtest to `declarative_schema_changer/job-compatibility-mixed-version-V241-V242`. Epic: CRDB-35306 Informs: cockroachdb#116395 Release note: None admission: introduce elastic io token exhausted duration metric This patch adds a new metric `elastic_io_tokens_exhausted_duration.kv`. This is similar to the existing `admission.granter.io_tokens_exhausted_duration.kv`, but for elastic traffic. The patch also does some code refactoring to make it easier to use both regular and elastic equivalent metrics. Informs cockroachdb#121574. Release note: None sql: expose an ability to request redacted stmt bundles This commit extends the stmt bundle collection infrastructure to support requesting redacted stmt bundles. Previously, this was only possible via an explicit `EXPLAIN ANALYZE (DEBUG, REDACT)` stmt, but this commit adds the support via overloads to `crdb_internal.request_statement_bundle` builtin as well as the server API. This paves the way for requesting the redacted bundles via the DB Console, but it'll be done separately. To support this we need to add another boolean column to `system.statement_diagnostics_requests` table to indicate whether a request is for the redacted bundle or not. This requires adding a migration in which we populate the existing rows with all `false` values. In order to reduce the amount of code churn and simplify this change, this commit repurposes some of the existing code that we introduced the last time we added a migration for plan-gist matching in stmt bundles. In other words, this commit simultenously removes some stale code and handles the new column. Release note: None sql: simplify usage of some enum settings This commit adjusts a few enum settings to use `EnumSetting.String` when defining the session variables to clean up the code a bit. Release note: None execinfra: add a sanity check that DistSQL version is not bumped The last bump was in 23.1, and we won't ever bump DistSQL version in backwards-incompatible going forward. Epic: None Release note: None Updated the AUTHORS file to include Naveen Setlur sqlstats: remove unnecessary StatementFingerprintID construction Previously we were constructing the stmt fingerprint id unnecessarily in 2 functions in sqlstats: - The Next function for the StmtStatsIterator - PopAllStats, which returns the sql stats collected thus far and resets the in-memory sql stats containers. The ID is available in the recorded statement in both of these cases- there's no need to reconstruct this value and doing is needlessly expensive. Epic: none Release note: None pgwire: add a metric for number of pipelined requests Release note (ops change): Added the sql.pgwire.pipeline.count metric, which is a gauge that measures how many wire protocol commands have been received by the server, but have not yet begun processing. This metric will only grow if clients are using the "pipeline mode" of the Postgres wire protocol. roachtest: skip multi-store-remove while debugging This roachtest has failed consistently for the past few days, since being enabled in cockroachdb#123694. Skip while we debug, to avoid noise. Touches: cockroachdb#123989. Release note: None. clusterversion: remove deprecated internal versions Remove deprecated 23.2 gates - these features will always be enabled in any cluster that a 24.2 node is part of. `TODODelete_V23_1` has a lot of uses and will be removed separately. Epic: none Release note: None sql: fix the type of copy_num_retries_per_batch We were incorrectly using the boolean getter for this integer variable. The bug was introduced when this variable was added in 1c1d313. Release note: None sql: fix cost_scans_with_default_col_size This commit fixes an omission where `cost_scans_with_default_col_size` session variable didn't actually use the corresponding cluster setting for its default value. In other words, the cluster setting actually had no effect. The bug has been present since this variable was introduced in af32d9d. Release note: None sql: fix up defaults for a few session variables This commit adjusts global defaults for a few session variables to use the postgres style (so that the global default value and session variable value would be stored alike). Release note: None sql: improve env collection in stmt bundles This commit refactors how we collect session variables and cluster settings into `env.sql` file in the stmt bundle. Previously, we used a hard-coded list of session variables that were always included while ignoring all variables not in the list. If a variable had a value different from the "binary default" (meaning the value that you got right after the cluster startup, before any cluster setting modifications are applied), then it would be in a SET statement that would run on the `bundle recreate`; if the value was the same as "binary default", then the SET statement was commented out. Also, all cluster setting values were included into the file as a comment. This commit makes it so that we include all session variables and cluster settings that differ from their defaults. This allows us to recreate the environment while also reducing the overhead of including everything into `env.sql`. For session variables further clarification is warranted: - all read-only variables have been audited, and most are explicitly excluded. Out of 23 currently present session variables, only 9 were deemed to be possibly useful during investigations, so they are not excluded (one of 9 is `crdb_version` which we print out separately, so it's actually excluded too). - for writable variables, we include it if its value differs from the "binary" default (the one that you get on a fresh cluster) or from the "cluster" default (the one that you get on a fresh session on the current cluster). The latter kind is obtained via the provided `settings.Values` container whereas the former is obtained via a global singleton container. Additionally, this commit adjusts the SET and SET CLUSTER SETTING statements to have single quotes around the setting values that need them. For cluster settings all types work with having single quotes, but for session variables some (like integers) don't work with quotes while others (like strings) need them. This commit adds a map for those that need them as well as a simple test to catch some of the missing ones (the list might be incomplete given that the test exercises the default config). All values are adjusted to fit on a single line (we have some cluster settings that might span multiple lines). It also adjusts `EXPLAIN (OPT, ENV)` to include the list of cluster settings with non-default values. Release note: None roachprod: fetch secrets from cloud store Due to the complexity of fetching the secrets from the secrets manager, the secrets are now maintained in cloud storage. Fixes: cockroachdb#117125 Epic: none roachprod: deploy cockroach Previously, to upgrade a cluster it had to be done manually by running several commands to copy over new binaries, drain and stop cockroach and then running the same start command. This change introduces a `deploy` command that works similar as `stage` but also deploys the new version to the cluster by doing all of the above. It does it node for node to ensure a cluster remains in a healthy state especially if a load balancer is used to manage connections. Epic: None Release Note: None roachprod: add deploy applications explanation Epic: None Release Notes: None roachtest: add cluster settings operations This change adds operations to mutate cluster settings. The values are mutated based on a preset frequency and RNG. For example, if the frequency of a cluster setting operation is set to "30 minutes" it will only change to a new value every 30 minutes, even if the operation has been run multiple times within the same 30-minute window. Running in the same window will just set the same value again. Epic: None Release Note: None roachtest: fix `OperationRequiresNodes` dependency check Epic: None Release Note: None roachprod: cluster settings operation owners Epic: None Release Notes: None roachprod: fix UT to ignore the yaml format currently, the test is looking for the exact match of the yaml. this can fail as the yaml is generated from a map and the values may not be in the same sequence. a better approach is to use a yaml parser, which will be done very soon. Fixes: cockroachdb#124276 Epic: none pgwire: deflake TestPipelineMetric The wrong context was used for the server, which meant that it could get cancelled too early. Release note: None roachprod, roachtest: use same cluster name sanitization Previously, roachtest had it's own function to sanitize cluster names, while roachprod had it's own function to verify cluster names. This change removes both and opts instead to use vm.DNSSafeName. This change also introduces MalformedClusterNameError which gives a hint on what is wrong with the name and tells roachtest not to retry cluster creation. intentresolver: mark the test heavy to prevent OOM Epic: none Release note: none ui: move enum type to avoid circular dependencies When importing the ViewMode enum into a unit test, an error is thrown in localStorage.reducer.ts relating to ViewMode being undefined. This is occurring due to a circular dependency. To fix, ViewMode is moved to a separate types.ts file Release note: None ui: prevent /_status/nodes calls for secondary tenants /_status/nodes is not implemented for secondary tenants, resulting in 501 responses in cloud for serverless tenants. To fix, this api is now gated by a tenant check and will no longer be called by serverless / secondary tenants. Fixes: CC-26900 Epic: None Release note: None cli/doctor: doctor should read from the right jobs table In cockroachdb#97762, we started writing a job's payload (and progress) information to the `system.jobs_info` table. As a result, we had to change the parts of our code that relied on the `system.jobs` table to use `crdb_internal.system_jobs` instead (since that table would inaccurately report that some `payload`s were `NULL`). This change did not occur for the in-memory representation of the jobs table created by debug doctor -- which can result in missing job false-positives. This patch updates debug doctor's representation of the jobs table by referring to `crdb_internal.system_jobs` instead. Epic: none Fixes: cockroachdb#122675 Release note: None doctor: skip validation for dropped descriptors In some cases, dropped descriptors appear in our `system.descriptors` table with dangling job mutations without an associated job. This patch teaches debug doctor examine to skip validation on such dropped descriptors. Epic: none Fixes: cockroachdb#123477 Fixes: cockroachdb#122956 Release note: none backupccl: test that we actually downloaded all data during online restore Release note: none Epic: none roachtest: check downloaded spans in online restore roundtrip test Epic: none Release note: none workflows: set timeouts for GitHub Actions Essential CI jobs I've selected the timeouts according to how long each job seems to run in practice, with a big buffer to allow for load on the remote execution cluster, etc. The longest-running jobs I've given 60 minutes, with less time for the shorter jobs. Epic: CRDB-8308 Release note: None master: Update pkg/testutils/release/cockroach_releases.yaml Update pkg/testutils/release/cockroach_releases.yaml with recent values. Epic: None Release note: None Release justification: test-only updates cdctest: simplify orderValidator.NoteRow code This patch eliminates unnecessary nesting in orderValidator's NoteRow method. Release note: None cdctest: rename MakeCountValidator to NewCountValidator This patch renames `MakeCountValidator` to `NewCountValidator` to reflect that it returns a pointer, not a struct. Release note: None roachtest: rename cdc/sink-chaos to cdc/kafka-chaos and add more logging This patch renames the `cdc/sink-chaos` test to `cdc/kafka-chaos` to more accurately reflect that it is a Kafka-only test. It also adds logging for the Kafka chaos loop iteration number so that we can tell when the Kafka cluster is restarting from the logs. Release note: None roachtest: add order validation to CDC Kafka roachtests This patch adds order validation to CDC Kafka roachtests so that we can build more confidence in our ordering guarantees. It can be enabled for a roachtest either by directly setting the `validateOrder` flag on a `kafkaManager` before creating consumers, or indirectly by setting the `validateOrder` flag on `kafkaFeedArgs` for tests that use `cdcTester`. Release note: None sql: alter primary key hangs if index references exist Previously, ALTER PRIMARY KEY could hang if an index reference existed from either a view or function using the index selection syntax. This was becasue neither the declarative schema changer or legacy schema changer support updating these references. To address this, this patch will prevent ALTER PRIMARY KEY if such references exist to any indexes that will be recreated. Fixes: cockroachdb#123017 Release note (bug fix): ALTER PRIMARY KEY could hang if the table had any indexes which were referred to by views or functions using the force index syntax. roachprod: remove logging from updatePrometheusTargets It's hard to tell what the logging is about when it happens in the context of a roachtest. It's also verbose as it prints one log line for each VM whenever a test starts cockroach. Epic: none Release note: None roachtest: explicitly close monitor reader after closing session Previously the reader did not close, even though the session is closed. This change ensures the reader is closed by closing it after the session has been closed. This seems to have only been a problem with local clusters. Fixes: cockroachdb#116314 Epic: None Release Note: None roachtest: lower max-upgrades in `tpcc/mixed-headroom` Avoids timeouts. Fixes: cockroachdb#124264 Release note: None scripts: update `drtprod` create drt-large Bring the creation in line with what is currently running on cockroach-drt. Uses zone expansion to balance the nodes evenly across regions. Epic: None Release Note: None scripts: update `drtprod` drt-large region In an effort to save cost on regional network transfers the Warsaw Europe Zone will now be switched to Columbus. Epic: None Release Note: None backupccl: deflake TestDataDriven_external_connections_privileges This test already tests the permissions end-to-end by asserting that the user can or cannot complete the expected actions. No need to print the system database as well. Fixes cockroachdb#123379 Release note: None externalconn: implement `SHOW EXTERNAL CONNECTIONS`. This new query displays redacted URIs of external connections along with other valuable information. Epic: CRDB-15001 Fixes: cockroachdb#85905 Release note (sql change): New queries `SHOW EXTERNAL CONNECTIONS` and `SHOW EXTERNAL CONNECTION <connection name>` have been added. These queries display redacted connection URIs and other useful information such as the connection type. Access to these queries is restricted to the owner of the connection or users with `USAGE` privilege. Admin or root users will have unrestricted access to all connections. kvstorage: speed up scan for range descriptors On startup we currently scan through all `/Local/Range/` keys in order to find the range descriptors. This involves going through all `/Local/Range/Transaction` records and tombstones so we end up reading through a lot of data. This change switches to using `SeekGE`: - when we encounter a key that has a suffix before `rdsc` like `/Local/Range/foo/prbe`, we seek directly to the corresponding descriptor key `/Local/Range/foo/rdsc`; - after we decode a descriptor, we seek straight to the next possible range descriptor `/Local/Range/EndKey/rdsc` using the `EndKey` in the descriptor. Note that inside Pebble, seeks to keys that are very close to the current position are optimized to try to iterate through the next few keys before doing a real seek. So this change should not be detrimental even when there aren't a lot of keys to skip. I experimented on a store captured after running kv0 with batch size 2 against a single node. Before this change we step through 353 keys but the iterator has to internally step through 160k tombstones: ``` range descriptor iteration done: 339 keys, 65 range descriptors (by suffix: map[qlpt:81 rdsc:65 txn-:193]); scan stats: stepped 353 times (169.743k internal); seeked 2 times (2 internal)... ``` After, we seek around the transaction records and end up doing just a seek per range: ``` range descriptor iteration done: 65 range descriptors, 0 intents, 0 tombstones +‹(interface (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)), (internal (dir, seek, step): (fwd, 67, 0), (rev, 0, 0)),› +‹(internal-stats: (block-bytes: (total 563KB, cached 26KB, read-time 755.128µs))... ``` Fixes: cockroachdb#109740 kvserver: read lease under mutex when switching lease type A race could occur when a replica queue and post lease application both attempted to switch the lease type. This race would cause the queue to not process the replica because the lease type had already changed. As a result, lease preference violations might not have been quickly resolved by the lease queue. Read the lease under the same mutex used for requesting the lease, when possibly switching the lease type. Resolves: cockroachdb#123998 Release note: None logictest: skip flaky select_for_update test Informs: cockroachdb#124197. Epic: None Release note: None mod comment roachtest: save cluster earlier if debug-always is set When the --debug-always flag is passed, we mark the cluster as saved to let the cluster destroy know not to destroy it. Previously, we only marked a cluster as saved at the end of a test. However, if the caller terminated the test through ctrl c, which also destroys all clusters, it would lead to a race condition where sometimes the cluster would not be saved in time and get destroyed. This change moves the cluster save to immediantly after the cluster is created. Release note: none Fixes: none Epic: none
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Per [1], in order to support multiple tenants on the same host, a unique, custom SQLPort is assigned (instead of
DefaultSQLPort
), similarly forDefaultAdminUIPort
. The latter breaks current integration with Prometheus' service discovery which is configured to scrape onDefaultAdminUIPort
(26258). Per [2], some tests revert to it via an explicit call toSetDefaultAdminUIPort
. Otherwise, a custom AdminUIPort is used, in which case the metrics no longer show up in Grafana. Thus, a number of roachtest failures are now without metrics; e.g., grafana link in [3] yields no data.[1] #111052
[2] #114097
[3] #116992
Jira issue: CRDB-34993
The text was updated successfully, but these errors were encountered: