Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multitenant: timeseries double counts some metrics #108929

Closed
lidorcarmel opened this issue Aug 17, 2023 · 4 comments · Fixed by #109727
Closed

multitenant: timeseries double counts some metrics #108929

lidorcarmel opened this issue Aug 17, 2023 · 4 comments · Fixed by #109727
Assignees
Labels
A-multitenancy Related to multi-tenancy A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. db-cy-23

Comments

@lidorcarmel
Copy link
Contributor

lidorcarmel commented Aug 17, 2023

In c2c, source cluster, I see write BW to disk over 1.3GBps, but the GCP console says 0.66GBps. Looks like IOPS have the same issue. If I just look at the system tenant I still see the double BW, but if I look at the app tenant I see the real 0.66GBps number.

Screenshot 2023-08-17 at 2 42 25 PM Screenshot 2023-08-17 at 2 40 13 PM

Jira issue: CRDB-30702

@lidorcarmel lidorcarmel added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-multitenancy Related to multi-tenancy T-multitenant Issues owned by the multi-tenant virtual team labels Aug 17, 2023
@abarganier
Copy link
Contributor

Thanks for reporting, let me see what I can find. Will report back.

@abarganier
Copy link
Contributor

abarganier commented Aug 17, 2023

This is reproducible using a multitenant cockroach demo cluster.

Procedure:

  1. cockroach demo
  2. Open DB Console
  3. Open the Metrics page, such as "Hardware".
  4. Observe a value on one of the charts, such as "Host CPU Percent" with the "Tenant" dropdown set to "System" or "All"
  5. Now switch the "Tenant" dropdown to "demoapp" and observe the same chart. Note that its values are 1/2 of when the dropdown is set to "System" or "All".

I believe the problem here exists with the query side. The data in tsdb looks correct to me.

Let's focus on a metric that this occurs with (note: SQL level metrics appear to not be affected). We will choose cr.node.sys.cpu.host.combined.percent-normalized.

The MetricsRecorder records node-level metrics (including our CPU metric) to TSDB segmented by tenant. For example:

$ ./cockroach debug tsdump --certs-dir=/Users/abarganier/.cockroach-demo --format=text --host=localhost:26358
...
cr.node.sys.cpu.host.combined.percent-normalized 1
1692306000000000000 0.31227394648052653
1692306010000000000 0.30643540341695974
1692306020000000000 0.3125720067723935
1692306030000000000 0.3125720067723935
1692306040000000000 0.19457822044461875
cr.node.sys.cpu.host.combined.percent-normalized 1-2
1692306000000000000 0.31227394648052653
1692306010000000000 0.30643540341695974
1692306020000000000 0.3125720067723935
1692306030000000000 0.19457822044461875
1692306040000000000 0.17129056189003986

We can see the metric exists in CRDB twice, one with a source field of 1 (node 1 of the system tenant), and another with a source field of 1-2 (node 1, of the tenant with tenant ID 2, hence 1-2).

Both metrics alone are recorded correctly. However, when you select "All" or "System" from the dropdown on the metrics page, the tsdb query appears to be summing the values together across tenants, which leads to our reported values being 2x what they actually are.

We added the behavior for multitenant timeseries queries in #98077, which was made available in the UI with #92694. There was likely a misunderstanding in the UI code as to how the query API works, or perhaps the query code itself is not intuitive.

We should find the right solution here - does it exist in the query layer? Or in how the UI uses the query endpoint? We will get this scheduled with the team.

@Santamaura
Copy link
Contributor

We can see the metric exists in CRDB twice, one with a source field of 1 (node 1 of the system tenant), and another with a source field of 1-2 (node 1, of the tenant with tenant ID 2, hence 1-2).

The ui queries for metrics based on the selection of that dropdown; selecting "All" sets tenant_id=undefined, "System" sets tenant_id=1, "demoapp" sets tenant_id=2 so there's probably somewhere in the server side code that interprets tenant_id=undefined and tenant_id=1 as the same, will look some more

@Santamaura
Copy link
Contributor

So it seems like the current server side query behaviour is what I said previously; if it's system tenant or no tenant id provided we return the aggregated ts data across all tenants which I suppose is because in the initial implementation the ts data being viewed didn't need to be as granular but now that it does need to be more granular we should update the code to reflect this. Should have a PR up soon to fix this.

Santamaura added a commit to Santamaura/cockroach that referenced this issue Aug 30, 2023
Previously, ts queries would consider providing no
tenant id and the system tenant id as the same and
would return all the aggregated datapoints. This was
likely due to the original implementation considering
that the system tenant would always want to view all
the aggregated data.

This is not the case anymore since the system tenant has
the ability to view all the data, system tenant specific data
or other tenants data. Therefore this commit adjusts the server
query code so that if a system tenant id is provided, it returns
data for only the system tenant.

Fixes cockroachdb#108929

Release note (bug fix): adjust ts server queries
to be able to return system tenant only metrics if tenant
id is provided, this will fix an issue where some metrics
graphs appear to double count.
craig bot pushed a commit that referenced this issue Aug 31, 2023
109727: ts: update server queries to account for system tenant id r=Santamaura a=Santamaura

Previously, ts queries would consider providing no tenant id and the system tenant id as the same and would return all the aggregated datapoints. This was likely due to the original implementation considering that the system tenant would always want to view all the aggregated data.

This is not the case anymore since the system tenant has the ability to view all the data, system tenant specific data or other tenants data. Therefore this commit adjusts the server query code so that if a system tenant id is provided, it returns data for only the system tenant.

Fixes #108929

Release note (bug fix): adjust ts server queries
to be able to return system tenant only metrics if tenant id is provided, this will fix an issue where some metrics graphs appear to double count.

Some screenshots after the change:
All
<img width="1422" alt="Screenshot 2023-08-30 at 10 59 50 AM" src="https://github.com/cockroachdb/cockroach/assets/17861665/2ddfb7b8-1980-4b88-9b92-ec2cba5e48f0">

System
<img width="1422" alt="Screenshot 2023-08-30 at 11 00 25 AM" src="https://github.com/cockroachdb/cockroach/assets/17861665/5e7b18d7-4b9d-48dd-881c-4417bab104b1">

Tenant
<img width="1422" alt="Screenshot 2023-08-30 at 11 00 13 AM" src="https://github.com/cockroachdb/cockroach/assets/17861665/b02a6683-5277-4fa7-a212-2999db935fd4">



109793: storage: don't reread value during inline conditional writes r=itsbilal a=nvanbenschoten

This commit removes the call to maybeGetValue when performing an inline conditional write. In such cases, we will have already read the value in the call to mvccGetMetadata, so we don't need to do so again.

I'm not aware of any workloads that are sensitive to the performance of conditional inline writes and I also suspect that the positioning of the Pebble iterator was avoiding some work during the second seek, so this is mainly just intended to be a code simplification.

Epic: None
Release note: None

Co-authored-by: Alex Santamaura <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in 738d312 Aug 31, 2023
craig bot pushed a commit that referenced this issue Sep 14, 2023
109042: server,cli: separate storage-level and application-level metrics r=yuzefovich a=knz

Prerequisite to #102378. 
Informs #108929.
Epic:  CRDB-26691.

Context: we have various projects that want to know which metrics
belong to the application layer (and thus are instantiated and
collected anew in every tenant); and which belong to the storage/KV
layer (and thus exist only once per node in the storage layer).

Prior to this patch, it was hard to distinguish between them.

This patch enhances the situation as follows:

- It introduces the concept of "server layering" for metrics, a
  concept which we had already introduced earlier in the
  `TestServerInterface`:

  - "storage" designates the storage layer, and only contains
    metrics relevant to the storage/kv layer.

  - "application" designates the application layer, and only
    contains metrics relevant to the application layer.

  - "server" is a special (pseudo) layer which contains metrics
    defined process-wide, and are thus reported both in combined
    sql/kv nodes and sql-only servers.

- It also uses *separate metric registries* to collect the metrics
  at each layer during server initialization.

  This is the main component that we were missing for later projects.

- It also enhances the `ChartCatalog` API endpoint and introduces
  a new `cockroach gen metric-list` command to use it and
  auto-generate a list of metrics with their layer information.

Release note (cli change): A new `cockroach gen metric-list` is now
available that can generate metadata that describes the various
metrics collected by an (idle) server. Note that the list does
not include dynamic metric names whose names are generated
based on workload.

110624: sql: disable READ COMMITTED syntax by default r=chrisseto,nvanbenschoten a=rafiss

fixes #107980

Release note (sql change): The cluster setting
sql.txn.read_committed_syntax.enabled was added. It defaults to false. When set to true, the following statements will configure transactions to run under READ COMMITTED isolation, rather than being automatically interpeted as SERIALIZABLE.
- BEGIN TRANSACTION ISOLATION LEVEL READ COMMITTED
- SET TRANSACTION ISOLATION LEVEL READ COMMITTED
- SET default_transaction_isolation = 'read committed'
- SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL READ COMMITTED

This setting is added since READ COMMITTED transactions are a preview
feature, so usage of it is opt-in for v23.2. In a future CockroachDB major
version, this setting will change to default to true.

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Nov 17, 2023
Previously, ts queries would consider providing no
tenant id and the system tenant id as the same and
would return all the aggregated datapoints. This was
likely due to the original implementation considering
that the system tenant would always want to view all
the aggregated data.

This is not the case anymore since the system tenant has
the ability to view all the data, system tenant specific data
or other tenants data. Therefore this commit adjusts the server
query code so that if a system tenant id is provided, it returns
data for only the system tenant.

Fixes #108929

Release note (bug fix): adjust ts server queries
to be able to return system tenant only metrics if tenant
id is provided, this will fix an issue where some metrics
graphs appear to double count.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy A-observability-inf C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. db-cy-23
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants