multitenant: timeseries double counts some metrics #108929

lidorcarmel · 2023-08-17T19:55:01Z

In c2c, source cluster, I see write BW to disk over 1.3GBps, but the GCP console says 0.66GBps. Looks like IOPS have the same issue. If I just look at the system tenant I still see the double BW, but if I look at the app tenant I see the real 0.66GBps number.

Jira issue: CRDB-30702

abarganier · 2023-08-17T20:03:18Z

Thanks for reporting, let me see what I can find. Will report back.

abarganier · 2023-08-17T21:10:39Z

This is reproducible using a multitenant cockroach demo cluster.

Procedure:

cockroach demo
Open DB Console
Open the Metrics page, such as "Hardware".
Observe a value on one of the charts, such as "Host CPU Percent" with the "Tenant" dropdown set to "System" or "All"
Now switch the "Tenant" dropdown to "demoapp" and observe the same chart. Note that its values are 1/2 of when the dropdown is set to "System" or "All".

I believe the problem here exists with the query side. The data in tsdb looks correct to me.

Let's focus on a metric that this occurs with (note: SQL level metrics appear to not be affected). We will choose cr.node.sys.cpu.host.combined.percent-normalized.

The MetricsRecorder records node-level metrics (including our CPU metric) to TSDB segmented by tenant. For example:

$ ./cockroach debug tsdump --certs-dir=/Users/abarganier/.cockroach-demo --format=text --host=localhost:26358
...
cr.node.sys.cpu.host.combined.percent-normalized 1
1692306000000000000 0.31227394648052653
1692306010000000000 0.30643540341695974
1692306020000000000 0.3125720067723935
1692306030000000000 0.3125720067723935
1692306040000000000 0.19457822044461875
cr.node.sys.cpu.host.combined.percent-normalized 1-2
1692306000000000000 0.31227394648052653
1692306010000000000 0.30643540341695974
1692306020000000000 0.3125720067723935
1692306030000000000 0.19457822044461875
1692306040000000000 0.17129056189003986

We can see the metric exists in CRDB twice, one with a source field of 1 (node 1 of the system tenant), and another with a source field of 1-2 (node 1, of the tenant with tenant ID 2, hence 1-2).

Both metrics alone are recorded correctly. However, when you select "All" or "System" from the dropdown on the metrics page, the tsdb query appears to be summing the values together across tenants, which leads to our reported values being 2x what they actually are.

We added the behavior for multitenant timeseries queries in #98077, which was made available in the UI with #92694. There was likely a misunderstanding in the UI code as to how the query API works, or perhaps the query code itself is not intuitive.

We should find the right solution here - does it exist in the query layer? Or in how the UI uses the query endpoint? We will get this scheduled with the team.

Santamaura · 2023-08-29T19:01:40Z

We can see the metric exists in CRDB twice, one with a source field of 1 (node 1 of the system tenant), and another with a source field of 1-2 (node 1, of the tenant with tenant ID 2, hence 1-2).

The ui queries for metrics based on the selection of that dropdown; selecting "All" sets tenant_id=undefined, "System" sets tenant_id=1, "demoapp" sets tenant_id=2 so there's probably somewhere in the server side code that interprets tenant_id=undefined and tenant_id=1 as the same, will look some more

Santamaura · 2023-08-29T21:28:59Z

So it seems like the current server side query behaviour is what I said previously; if it's system tenant or no tenant id provided we return the aggregated ts data across all tenants which I suppose is because in the initial implementation the ts data being viewed didn't need to be as granular but now that it does need to be more granular we should update the code to reflect this. Should have a PR up soon to fix this.

Previously, ts queries would consider providing no tenant id and the system tenant id as the same and would return all the aggregated datapoints. This was likely due to the original implementation considering that the system tenant would always want to view all the aggregated data. This is not the case anymore since the system tenant has the ability to view all the data, system tenant specific data or other tenants data. Therefore this commit adjusts the server query code so that if a system tenant id is provided, it returns data for only the system tenant. Fixes cockroachdb#108929 Release note (bug fix): adjust ts server queries to be able to return system tenant only metrics if tenant id is provided, this will fix an issue where some metrics graphs appear to double count.

109727: ts: update server queries to account for system tenant id r=Santamaura a=Santamaura Previously, ts queries would consider providing no tenant id and the system tenant id as the same and would return all the aggregated datapoints. This was likely due to the original implementation considering that the system tenant would always want to view all the aggregated data. This is not the case anymore since the system tenant has the ability to view all the data, system tenant specific data or other tenants data. Therefore this commit adjusts the server query code so that if a system tenant id is provided, it returns data for only the system tenant. Fixes #108929 Release note (bug fix): adjust ts server queries to be able to return system tenant only metrics if tenant id is provided, this will fix an issue where some metrics graphs appear to double count. Some screenshots after the change: All <img width="1422" alt="Screenshot 2023-08-30 at 10 59 50 AM" src="https://github.com/cockroachdb/cockroach/assets/17861665/2ddfb7b8-1980-4b88-9b92-ec2cba5e48f0"> System <img width="1422" alt="Screenshot 2023-08-30 at 11 00 25 AM" src="https://github.com/cockroachdb/cockroach/assets/17861665/5e7b18d7-4b9d-48dd-881c-4417bab104b1"> Tenant <img width="1422" alt="Screenshot 2023-08-30 at 11 00 13 AM" src="https://github.com/cockroachdb/cockroach/assets/17861665/b02a6683-5277-4fa7-a212-2999db935fd4"> 109793: storage: don't reread value during inline conditional writes r=itsbilal a=nvanbenschoten This commit removes the call to maybeGetValue when performing an inline conditional write. In such cases, we will have already read the value in the call to mvccGetMetadata, so we don't need to do so again. I'm not aware of any workloads that are sensitive to the performance of conditional inline writes and I also suspect that the positioning of the Pebble iterator was avoiding some work during the second seek, so this is mainly just intended to be a code simplification. Epic: None Release note: None Co-authored-by: Alex Santamaura <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

109042: server,cli: separate storage-level and application-level metrics r=yuzefovich a=knz Prerequisite to #102378. Informs #108929. Epic: CRDB-26691. Context: we have various projects that want to know which metrics belong to the application layer (and thus are instantiated and collected anew in every tenant); and which belong to the storage/KV layer (and thus exist only once per node in the storage layer). Prior to this patch, it was hard to distinguish between them. This patch enhances the situation as follows: - It introduces the concept of "server layering" for metrics, a concept which we had already introduced earlier in the `TestServerInterface`: - "storage" designates the storage layer, and only contains metrics relevant to the storage/kv layer. - "application" designates the application layer, and only contains metrics relevant to the application layer. - "server" is a special (pseudo) layer which contains metrics defined process-wide, and are thus reported both in combined sql/kv nodes and sql-only servers. - It also uses *separate metric registries* to collect the metrics at each layer during server initialization. This is the main component that we were missing for later projects. - It also enhances the `ChartCatalog` API endpoint and introduces a new `cockroach gen metric-list` command to use it and auto-generate a list of metrics with their layer information. Release note (cli change): A new `cockroach gen metric-list` is now available that can generate metadata that describes the various metrics collected by an (idle) server. Note that the list does not include dynamic metric names whose names are generated based on workload. 110624: sql: disable READ COMMITTED syntax by default r=chrisseto,nvanbenschoten a=rafiss fixes #107980 Release note (sql change): The cluster setting sql.txn.read_committed_syntax.enabled was added. It defaults to false. When set to true, the following statements will configure transactions to run under READ COMMITTED isolation, rather than being automatically interpeted as SERIALIZABLE. - BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED - SET TRANSACTION ISOLATION LEVEL READ COMMITTED - SET default_transaction_isolation = 'read committed' - SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMITTED This setting is added since READ COMMITTED transactions are a preview feature, so usage of it is opt-in for v23.2. In a future CockroachDB major version, this setting will change to default to true. Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

Previously, ts queries would consider providing no tenant id and the system tenant id as the same and would return all the aggregated datapoints. This was likely due to the original implementation considering that the system tenant would always want to view all the aggregated data. This is not the case anymore since the system tenant has the ability to view all the data, system tenant specific data or other tenants data. Therefore this commit adjusts the server query code so that if a system tenant id is provided, it returns data for only the system tenant. Fixes #108929 Release note (bug fix): adjust ts server queries to be able to return system tenant only metrics if tenant id is provided, this will fix an issue where some metrics graphs appear to double count.

lidorcarmel added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-multitenancy Related to multi-tenancy T-multitenant Issues owned by the multi-tenant virtual team labels Aug 17, 2023

abarganier added the T-observability-inf label Aug 17, 2023

blathers-crl bot added the A-observability-inf label Aug 17, 2023

exalate-issue-sync bot added T-observability-inf and removed T-observability-inf T-multitenant Issues owned by the multi-tenant virtual team labels Aug 17, 2023

abarganier added the db-cy-23 label Aug 18, 2023

knz mentioned this issue Aug 18, 2023

server,cli: separate storage-level and application-level metrics #109042

Merged

exalate-issue-sync bot assigned Santamaura Aug 29, 2023

Santamaura mentioned this issue Aug 30, 2023

ts: update server queries to account for system tenant id #109727

Merged

craig bot closed this as completed in 738d312 Aug 31, 2023

blathers-crl bot mentioned this issue Nov 17, 2023

release-23.1: ts: update server queries to account for system tenant id #114662

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multitenant: timeseries double counts some metrics #108929

multitenant: timeseries double counts some metrics #108929

lidorcarmel commented Aug 17, 2023 •

edited by cockroach-jira-scripts

Loading

abarganier commented Aug 17, 2023

abarganier commented Aug 17, 2023 •

edited

Loading

Santamaura commented Aug 29, 2023

Santamaura commented Aug 29, 2023

multitenant: timeseries double counts some metrics #108929

multitenant: timeseries double counts some metrics #108929

Comments

lidorcarmel commented Aug 17, 2023 • edited by cockroach-jira-scripts Loading

abarganier commented Aug 17, 2023

abarganier commented Aug 17, 2023 • edited Loading

Santamaura commented Aug 29, 2023

Santamaura commented Aug 29, 2023

lidorcarmel commented Aug 17, 2023 •

edited by cockroach-jira-scripts

Loading

abarganier commented Aug 17, 2023 •

edited

Loading