Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stmtdiagnostics: implement range feed on system.statement_diagnostics_requests to reduce latency #47893

Open
tbg opened this issue Apr 22, 2020 · 19 comments · Fixed by #107555
Open
Labels
A-multitenancy Related to multi-tenancy T-multitenant Issues owned by the multi-tenant virtual team

Comments

@tbg
Copy link
Member

tbg commented Apr 22, 2020

We should implement the range feed on system.statement_diagnostics_requests table in order to remove the "polling" that currently happens every 10s. See this comment for some pointers.

Jira issue: CRDB-4382

Epic CRDB-18185

@tbg tbg added the A-multitenancy Related to multi-tenancy label Apr 22, 2020
@tbg tbg changed the title stmtdiagnostics: needs rework for multi-tenancy phase 2 stmtdiagnostics: needs rework for multi-tenancy Apr 22, 2020
@RaduBerinde
Copy link
Member

@andreimatei and @ajwerner discussed this quite a bit during the initial implementation. The gossip solution is seen as temporary indeed.

@ajwerner
Copy link
Contributor

Yep, this is on my list.

@ajwerner
Copy link
Contributor

Worst case we disable the gossip without anything better initially and clients will be exposed to a bit of extra latency when requesting statement diagnostics.

@tbg
Copy link
Member Author

tbg commented Sep 4, 2020

@ajwerner were you planning on touching this (I assume not anytime soon) and is this good enough for now as-is? We are nominally considering this a blocker still but it doesn't seem to be.

@tbg
Copy link
Member Author

tbg commented Sep 4, 2020

As you touch this issue, please also put it in the appropriate project. I would put it in SQL-Execution but I don't know if they own this.

@ajwerner
Copy link
Contributor

ajwerner commented Sep 4, 2020

Do we need to ensure that the gossip isn't called or does it just no-op? If it no-ops then we're good and don't need to do anything. We've not exposing the statements page for tenants right now anyway and even if we were, they poll periodically so it's just higher latency

@tbg
Copy link
Member Author

tbg commented Sep 8, 2020

From my reading of the stmtdiag code, it will return an error (unsupported w/ multi-tenancy) which will be returned to the requester of the stmt diag report. This is fine, so no need to do anything right now.

@RaduBerinde RaduBerinde self-assigned this Feb 10, 2021
@RaduBerinde
Copy link
Member

To fix this, we can use a range feed on the dignostic requests table.

@irfansharif
Copy link
Contributor

To fix this, we can use a range feed on the dignostic requests table.

We have libraries now to make this pretty plug-n-play:

// Watcher is used to implement a consistent cache over spans of KV data
// on top of a RangeFeed. Note that while rangefeeds offer events as they
// happen at low latency, a consistent snapshot cannot be constructed until
// resolved timestamp checkpoints have been received for the relevant spans.
// The watcher internally buffers events until the complete span has been
// resolved to some timestamp, at which point the events up to that timestamp
// are published as an update. This will take on the order of the
// kv.closed_timestamp.target_duration cluster setting (default 3s).
//
// If the buffer overflows (as dictated by the buffer limit the Watcher is
// instantiated with), the old rangefeed is wound down and a new one
// re-established. The client interacts with data from the RangeFeed in two
// ways, firstly, by translating raw KVs into kvbuffer.Events, and by handling
// a batch of such events when either the initial scan completes or the
// frontier changes. The OnUpdateCallback which is handed a batch of events,
// called an Update, is informed whether the batch of events corresponds to a
// complete or incremental update.
//
// It's expected to be Start-ed once. Start internally invokes Run in a retry
// loop.
//
// The expectation is that the caller will use a mutex to update an underlying
// data structure.
//
// NOTE (for emphasis): Update events after the initial scan published at a
// delay corresponding to kv.closed_timestamp.target_duration (default 3s).
// Users seeking to leverage the Updates which arrive with that delay but also
// react to the row-level events as they arrive can hijack the translateEvent
// function to trigger some non-blocking action.
type Watcher struct {

This guy for example maintains such a feed (in the tenant pod, over a tenant system table):

// Cache caches a set of KVs in a set of spans using a rangefeed. The
// cache provides a consistent snapshot when available, but the snapshot
// may be stale.
type Cache struct {
w *rangefeedcache.Watcher

@yuzefovich
Copy link
Member

@rafiss could you take a look at this comment please? I don't understand why this issue has "skipped test" label.

@rafiss
Copy link
Collaborator

rafiss commented Jul 25, 2023

I added the label because I see this in the code:

skip.WithIssue(t, 47893, "tenant clusters do not support SQL features used by this test")

@rafiss
Copy link
Collaborator

rafiss commented Jul 25, 2023

I don't know if that code is referencing the correct issue. If not, feel free to create a separate issue for tracking the skipped test.

@yuzefovich
Copy link
Member

I see, thanks. This issue originally has been about the support of stmt diagnostics in secondary tenants, but #83547 added the support with some caveats, so this issue was repurposed to be about optimizing the stmt diagnostics feature. I'll remove that skip.

@rafiss rafiss linked a pull request Jul 26, 2023 that will close this issue
craig bot pushed a commit that referenced this issue Jul 26, 2023
107493: ui,build: push cluster-ui assets into external folder during watch mode r=nathanstilwell a=sjbarag

Previously, watch mode builds of cluster-ui (e.g. 'dev ui watch' or 'pnpm build:watch') would emit files only to
pkg/ui/workspaces/cluster-ui/dist. Using that output in a watch task of a private repo required setting up symlinks via a 'make' task[1]. Unfortunately, that symlink made it far too easy for the node module resolution algorithm in that private repo to follow the symlink back to cockroach.git, which gave that project access to the modules in pkg/ui/node_modules/ and pkg/ui/workspaces/cluster-ui/node_modules. This resulted in webpack finding multiple copies of react-router (which expects to be a singleton), typescript finding multiple incompatible versions of react, etc.

Unfortunately, webpack doesn't support multiple output directories natively. Add a custom webpack plugin that copies emitted files to an arbitrary number of output directories.

[1] pnpm link doesn't work due to some package-name aliasing we've got
    going on there.

Release note: None
Epic: none

107555: sql: remove stale skip in TestTelemetry r=yuzefovich a=yuzefovich

This commit unskips multiple telemetry tests that were skipped for no good reason (they were referencing an unrelated issue). This uncovered some bugs in the new schema changer telemetry reporting where we duplicated `_index` twice in the feature counter for inverted indexes. Also, `index` telemetry test contained an invalid statement which is now removed.

The only file that is still skipped is `sql-stats` where the output doesn't match the expectations, and I'm not sure whether the test is stale or something is broken, so a separate issue was filed.

Addresses: #47893.
Epic: None.

Release note: None

107597: builtins: force production values in TestSerialNormalizationWithUniqueUnorderedID r=yuzefovich a=yuzefovich

We've observed that if `batch-bytes-limit` value is set too low, then the "key counts" query in this test takes much longer (on my laptop it was 60s for a particular random seed vs 2.4s with production values), so this commit forces some production values.

Fixes: #106829.

Release note: None

Co-authored-by: Sean Barag <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in #107555 Jul 26, 2023
@yuzefovich
Copy link
Member

The issue about implementing the rangefeed on reduce latency is still present.

@yuzefovich yuzefovich reopened this Jul 26, 2023
@yuzefovich
Copy link
Member

@maryliag looks like you removed this from cluster observability project, but I think it'd be up to your team to address this issue (i.e. implementing the range feed as Irfan suggested here), do you agree? I'll update the issue description accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multitenancy Related to multi-tenancy T-multitenant Issues owned by the multi-tenant virtual team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants