Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

obs: export metrics about Go GC Assist work #118875

Merged
merged 1 commit into from
Feb 21, 2024
Merged

Conversation

lyang24
Copy link
Contributor

@lyang24 lyang24 commented Feb 7, 2024

This commit introduced functions to extract exposed metrics in go runtime
metrics api. The runtime metrics is sampled along in SampleEnvironment call
every 10 seconds. New metric GcAssistNS is added as an estimate to amount
of effort of user go routines assist in gc activities in nanoseconds.

Fixes: #88178

Relase note: None

Copy link

blathers-crl bot commented Feb 7, 2024

Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR.

Before a member of our team reviews your PR, I have some potential action items for you:

  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

I have added a few people who may be able to assist in reviewing:

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added O-community Originated from the community X-blathers-triaged blathers was able to find an owner labels Feb 7, 2024
@blathers-crl blathers-crl bot requested a review from nvanbenschoten February 7, 2024 04:05
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link

blathers-crl bot commented Feb 8, 2024

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@lyang24 lyang24 marked this pull request as ready for review February 8, 2024 06:47
@lyang24 lyang24 requested a review from a team February 8, 2024 06:47
@lyang24 lyang24 requested a review from a team as a code owner February 8, 2024 06:47
@lyang24 lyang24 requested review from abarganier and removed request for a team February 8, 2024 06:47
@lyang24 lyang24 marked this pull request as draft February 8, 2024 20:59
@lyang24
Copy link
Contributor Author

lyang24 commented Feb 8, 2024

checking out the test failures

Copy link

blathers-crl bot commented Feb 11, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link
Contributor Author

@lyang24 lyang24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)


pkg/server/status/runtime.go line 344 at r1 (raw file):

// in the metricSamples field.
func (grm *GoRuntimeSampler) sampleRuntimeMetrics() {
	metrics.Read(grm.metricSamples)

The alternative here is to always construct the sample array here and read from metrics and then return a map of metric name to metric values. It may be inefficient (it should be ok since we are running this sampling loop every 10s) this approach guarantees 100% correctness.

I am a little concerned with RuntimeStatSampler maybe shared in tenants.go, not clear if the current approach could run into a data race.

https://github.com/lyang24/cockroach/blob/3716ff39c4054d2dd7315897d0d077dc018b6d87/pkg/server/tenant.go#L1278-L1280

https://github.com/golang/go/blob/4a7f3ac8eb4381ea62caa1741eeeec28363245b4/src/runtime/metrics/sample.go#L27C1-L44C52


pkg/server/status/runtime.go line 357 at r1 (raw file):

	metricSamples := make([]metrics.Sample, len(metricNames))
	metricIndexes := make(map[string]int, len(metricNames))
	for i, n := range metricNames {

One idea is to construct sample array that contains all available metrics. In future if we want to get additional metrics it would be easier

Copy link
Contributor Author

@lyang24 lyang24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)


pkg/server/status/runtime.go line 704 at r1 (raw file):

	gcPauseRatio := float64(uint64(gc.PauseTotal)-rsr.last.gcPauseTime) / dur
	runnableSum := goschedstats.CumulativeNormalizedRunnableGoroutines()
	gcAssitSeconds := rsr.goRuntimeSampler.getFloat64(GcAssist)

The GcAssist metrics is exposed as float64 in seconds, should we do a conversion to nano seconds here?

Copy link

blathers-crl bot commented Feb 11, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link

blathers-crl bot commented Feb 12, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • Please ensure your git commit message contains a release note.
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Copy link

blathers-crl bot commented Feb 12, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

Thank you for updating your pull request.

My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@lyang24 lyang24 marked this pull request as ready for review February 12, 2024 17:31
@lyang24
Copy link
Contributor Author

lyang24 commented Feb 12, 2024

cannot figure out the experimental test failures, seems like a directory on the github action host is missing

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 5 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @lyang24)


-- commits line 6 at r2:
GcAssistSecond


pkg/server/status/runtime.go line 344 at r1 (raw file):

Previously, lyang24 (Lanqing Yang) wrote…

The alternative here is to always construct the sample array here and read from metrics and then return a map of metric name to metric values. It may be inefficient (it should be ok since we are running this sampling loop every 10s) this approach guarantees 100% correctness.

I am a little concerned with RuntimeStatSampler maybe shared in tenants.go, not clear if the current approach could run into a data race.

https://github.com/lyang24/cockroach/blob/3716ff39c4054d2dd7315897d0d077dc018b6d87/pkg/server/tenant.go#L1278-L1280

https://github.com/golang/go/blob/4a7f3ac8eb4381ea62caa1741eeeec28363245b4/src/runtime/metrics/sample.go#L27C1-L44C52

RuntimeStatSampler can be shared across tenants, but only so that each of those tenants has access to the metric fields. What you have here seems appropriate.


pkg/server/status/runtime.go line 357 at r1 (raw file):

Previously, lyang24 (Lanqing Yang) wrote…

One idea is to construct sample array that contains all available metrics. In future if we want to get additional metrics it would be easier

I think what you have here makes sense, because we don't want to pay for metrics in the metrics.Read call that we aren't using.


pkg/server/status/runtime.go line 704 at r1 (raw file):

Previously, lyang24 (Lanqing Yang) wrote…

The GcAssist metrics is exposed as float64 in seconds, should we do a conversion to nano seconds here?

I think that would make sense, to more closely mirror metrics like sys.gc.pause.ns.


pkg/server/status/runtime.go line 98 at r2 (raw file):

		Unit:        metric.Unit_PERCENT,
	}
	metaGCAssitSeconds = metric.Metadata{

"Assist"


pkg/server/status/runtime.go line 99 at r2 (raw file):

	}
	metaGCAssitSeconds = metric.Metadata{
		Name:        "sys.gc.assit.seconds",

"assist"


pkg/server/status/runtime.go line 100 at r2 (raw file):

	metaGCAssitSeconds = metric.Metadata{
		Name:        "sys.gc.assit.seconds",
		Help:        "Estimated total CPU time goroutines spent to assist the GC process",

"user goroutines"


pkg/server/status/runtime.go line 303 at r2 (raw file):

var getCgoMemStats func(context.Context) (uint, uint, error)

// Estimated total CPU time goroutines spent performing GC tasks to assist the GC and prevent it from falling behind the

nit: wrap at 80 chars.


pkg/server/status/runtime.go line 306 at r2 (raw file):

// application. This metric is an overestimate, and not directly comparable to system CPU time measurements. Compare
// only with other /cpu/classes metrics.
const GcAssist = "/cpu/classes/gc/mark/assist:cpu-seconds"

This name is a little too broad to be in the global namespace of this package. And it also doesn't need to be exported. How about runtimeMetricGCAssist?


pkg/server/status/runtime.go line 314 at r2 (raw file):

	// The collection of metrics we want to sample.
	metricSamples []metrics.Sample
	// The mapping to find metric slot in runtimeSamples by name.

s/runtimeSamples/metricSamples/


pkg/server/status/runtime.go line 318 at r2 (raw file):

}

// getIndex find postion of metrics in the sample array by name.

"finds the position"


pkg/server/status/runtime.go line 327 at r2 (raw file):

}

// getFloat64 get the sampled value by metrics name as getFloat64.

"gets"


pkg/server/status/runtime.go line 327 at r2 (raw file):

}

// getFloat64 get the sampled value by metrics name as getFloat64.

s/getFloat64/as a float64/


pkg/server/status/runtime.go line 337 at r2 (raw file):

// in the metricSamples field.
func (grm *GoRuntimeSampler) sampleRuntimeMetrics() {
	metrics.Read(grm.metricSamples)

Have you taken a look at how long this call takes?


pkg/server/status/runtime.go line 485 at r2 (raw file):

		log.Ops.Errorf(ctx, "could not get initial network stat counters: %v", err)
	}
	runtimeMetrics := []string{GcAssist}

Let's move this to a global, right up below GcAssist.


pkg/server/status/runtime.go line 587 at r2 (raw file):

// The CGoMemStats should be provided via GetCGoMemStats().
func (rsr *RuntimeStatSampler) SampleEnvironment(
	ctx context.Context, ms *GoMemStats, cs *CGoMemStats,

Have we taken a look at which runtime metrics we'll need to use to eliminate the use of runtime.ReadMemStats, in order to address #88178 (comment)? Do all of those metrics exist in runtime/metrics?


pkg/server/status/runtime.go line 697 at r2 (raw file):

	gcPauseRatio := float64(uint64(gc.PauseTotal)-rsr.last.gcPauseTime) / dur
	runnableSum := goschedstats.CumulativeNormalizedRunnableGoroutines()
	gcAssitSeconds := rsr.goRuntimeSampler.getFloat64(GcAssist)

"assist" here and everywhere else 😃

Copy link
Contributor Author

@lyang24 lyang24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed review. I will take one more look tmr and send it for another look.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)


pkg/server/status/runtime.go line 337 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Have you taken a look at how long this call takes?

added benchmark in the comment


pkg/server/status/runtime.go line 587 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Have we taken a look at which runtime metrics we'll need to use to eliminate the use of runtime.ReadMemStats, in order to address #88178 (comment)? Do all of those metrics exist in runtime/metrics?

yes I think i have found the metrics that have the same semantics as those fields used in memstats but i haven't yet verified the go runtime impl to see if they have exactly the same impl.

This is the mapping of memstats usage to runtime metrics in my mind
GoAllocBytes: ms.HeapAlloc => /gc/heap/allocs:bytes
HeapFragmentBytes: ms.HeapInuse - ms.HeapAlloc => /memory/classes/heap/unused:bytes
HeapReservedBytes: ms.HeapIdle - ms.HeapReleased => /memory/classes/heap/free:bytes
HeapReleasedBytes: ms.HeapReleased => /memory/classes/heap/released:bytes
MemStackSysBytes: ms.StackSys => /memory/classes/heap/stacks:bytes
goTotal := ms.Sys - ms.HeapReleased => /memory/classes/total:bytes

Copy link
Contributor Author

@lyang24 lyang24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)


pkg/util/log/eventpb/health_events.proto line 69 at r3 (raw file):

  // Estimated total CPU time user goroutines spent performing GC tasks to
  // assist the GC. Expressed in nanoseconds.
  uint64 gc_assist_ns = 20 [(gogoproto.customname) = "GCAssistNs", (gogoproto.jsontag) = ",omitempty"];

does it make sense to add this metrics to health_events

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @lyang24)


-- commits line 6 at r3:
"capture represent"


pkg/server/status/runtime.go line 337 at r2 (raw file):

Previously, lyang24 (Lanqing Yang) wrote…

added benchmark in the comment

Thanks for doing that!


pkg/util/log/eventpb/health_events.proto line 69 at r3 (raw file):

Previously, lyang24 (Lanqing Yang) wrote…

does it make sense to add this metrics to health_events

I don't think it needs to be added here. gc_pause_ns is not part of this either. It also looks like there are some compatibility concerns we'll need to contend with if we add the metric to health_events — see doc.go. So I'm 👍 on removing it from the PR.

This commit introduced functions to extract exposed metrics in go runtime
metrics api. The runtime metrics is sampled along in SampleEnvironment call
every 10 seconds. New metric GcAssistNS is added as an estimate to amount
of effort of user go routines assist in gc activities in nanoseconds.

Fixes: cockroachdb#88178

Release note: None
Copy link
Contributor Author

@lyang24 lyang24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I will replace go memstats with runtime metrics in the following pr

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @nvanbenschoten)


-- commits line 6 at r3:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"capture represent"

done

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @abarganier)

@lyang24
Copy link
Contributor Author

lyang24 commented Feb 21, 2024

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 21, 2024

Build succeeded:

@craig craig bot merged commit 481fa5c into cockroachdb:master Feb 21, 2024
8 of 23 checks passed
@nvanbenschoten
Copy link
Member

@lyang24 do you want to open a follow-up PR to add this new metric to the "Runtime" dashboard in the UI, right below the "GC Pause Time" graph?

lyang24 added a commit to lyang24/cockroach that referenced this pull request Feb 22, 2024
This commit adds a line chart contains gc assist duration on the runtime page.
The goal is to present the estimated time user go routines spend on assisting
gc tasks.

Informs: cockroachdb#118875

Release note: None
craig bot pushed a commit that referenced this pull request Mar 4, 2024
119506: ui: add charts on gc assist metric r=lyang24 a=lyang24

This commit adds a line chart contains gc assist duration on the runtime page. The goal is to present the estimated time user go routines spend on assisting gc tasks.

Informs: #118875

Release note: None

gc assist on crdb single node with kv workload gcttl=5s for database in (kv, system)
![image](https://github.com/cockroachdb/cockroach/assets/20375035/b88882d0-8390-4fe4-8343-eb2a7497bc9f)


119877: roachtest: make pg_regress weekly and its failures non-blocking r=yuzefovich a=yuzefovich

This should reduce the amount of toil for us until we iron out most of the differences with postgres.

Epic: None

Release note: None

Co-authored-by: lyang24 <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
nicktrav pushed a commit to nicktrav/cockroach that referenced this pull request Sep 5, 2024
This commit adds a line chart contains gc assist duration on the runtime page.
The goal is to present the estimated time user go routines spend on assisting
gc tasks.

Informs: cockroachdb#118875

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-community Originated from the community X-blathers-triaged blathers was able to find an owner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

obs: export metrics about Go GC Assist work
3 participants