Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insights: ingester #85350

Merged
merged 1 commit into from
Aug 11, 2022
Merged

insights: ingester #85350

merged 1 commit into from
Aug 11, 2022

Conversation

matthewtodd
Copy link
Contributor

@matthewtodd matthewtodd commented Jul 29, 2022

Closes #81021.

Here we begin observing statements and transactions asynchronously, to
avoid slowing down the hot sql execution path as much as possible.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@matthewtodd matthewtodd requested a review from a team July 29, 2022 21:14
Copy link
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w and @matthewtodd)


pkg/sql/sqlstats/insights/ingester.go line 33 at r1 (raw file):

type concurrentBufferIngester struct {
	guard struct {

What is the reason for using the ConcurrentBufferGuard instead of just a normal Channel? Channels are thread safe so I'm trying to understand what benefits it gives.


pkg/sql/sqlstats/insights/ingester.go line 39 at r1 (raw file):

	sink     chan *block
	delegate Registry

Would it be better to have a channel per a registry? That way if one is running slow or if the buffer is full it can skip that input for that registry and not impact the other registries.


pkg/sql/sqlstats/insights/ingester.go line 92 at r1 (raw file):

	sessionID clusterunique.ID, statement *Statement,
) {
	i.guard.AtomicWrite(func(writerIdx int64) {

Should there be a check if the buffer is full and log a message to avoid blocking to the registry is running to slow?

Closes #81021.

Here we begin observing statements and transactions asynchronously, to
avoid slowing down the hot sql execution path as much as possible.

Release note: None
Copy link
Contributor Author

@matthewtodd matthewtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @j82w)


pkg/sql/sqlstats/insights/ingester.go line 33 at r1 (raw file):

Previously, j82w wrote…

What is the reason for using the ConcurrentBufferGuard instead of just a normal Channel? Channels are thread safe so I'm trying to understand what benefits it gives.

Good question! A channel would be way simpler, but we're trying to be super fast here since we're on the SQL execution path. I ran some benchmarks, and the ConcurrentBufferGuard approach is about 7.5 times faster than just using a channel. (140ns/op vs. 1100ns/op on my laptop.)

There's a lot more goroutine coordination in the channel approach; ConcurrentBufferGuard is built around an atomic int and a read mutex, which is far lighter-weight.


pkg/sql/sqlstats/insights/ingester.go line 39 at r1 (raw file):

Previously, j82w wrote…

Would it be better to have a channel per a registry? That way if one is running slow or if the buffer is full it can skip that input for that registry and not impact the other registries.

There is only one Registry instance. I have been hoping that will be sufficient, since keeping the inner implementation "global" and single-threaded is easy to reason about and unit test. The hope for the ingester is to handle all the async / parallel work, feeding observations into it.

I think you're right, though, that there's perhaps more to understand and design for about how this thing behaves when it's overwhelmed. At the moment, some micro-benchmarking here shows level performance at 140ns/op with 1K concurrent sessions hammering the ingester, getting slightly slower at 10K concurrent sessions, so I feel confident enough to subject this thing to the roachtests as our next step.


pkg/sql/sqlstats/insights/ingester.go line 92 at r1 (raw file):

Previously, j82w wrote…

Should there be a check if the buffer is full and log a message to avoid blocking to the registry is running to slow?

At this level, I think we can let the ConcurrentBufferGuard handle that for us. While it will lock to flush the buffer into the channel, I think we're okay with amortizing that cost. And, again, I think the micro-benchmarks here do give us some reasonable first-order confidence that the Registry has more than enough performance capacity to keep up.

But, yes, I think some safety releases probably make sense for us to experiment with next week.

@matthewtodd matthewtodd marked this pull request as ready for review August 11, 2022 14:44
@matthewtodd matthewtodd requested a review from a team August 11, 2022 14:44
Copy link
Contributor

@j82w j82w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 5 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @matthewtodd)

@matthewtodd
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 11, 2022

Build succeeded:

@craig craig bot merged commit 8e3ee57 into cockroachdb:master Aug 11, 2022
@matthewtodd matthewtodd deleted the insights-ingester branch August 11, 2022 18:35
@renatolabs
Copy link
Contributor

renatolabs commented Aug 11, 2022

This change introduced a data race on TestConsumeJoinToken (#85988). I'm going to create a PR shortly to skip that test under race for the time being.

Edit: PR to skip the test: #85989.

@msbutler
Copy link
Collaborator

msbutler commented Aug 11, 2022

This PR is also introducing a consistent data race in my PR that was ready to merge until I rebased onto this. Should we revert this PR, as it seems to be exposing the codebase to data races? My data race stack also points to #83080

@matthewtodd
Copy link
Contributor Author

I've just sent #85994 to disable this problematic codepath until I can get to the bottom of it. I'm sorry for the disruption!

@msbutler
Copy link
Collaborator

thanks!

craig bot pushed a commit that referenced this pull request Aug 11, 2022
85994: insights: remove the ingester for now r=matthewtodd a=matthewtodd

In #85350 we introduced a data race that's affecting many branches in
CI. Until we can get to the bottom of it, probably in #83080, let's just
remove the offending codepath.

Release note: None

Co-authored-by: Matthew Todd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

outliers: move processing off the hot path
5 participants