-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emit GC metrics at flush time #144
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit.
Besides that, my concern isn't that it'd block flush times; more so that it'd impact metric collection times, since this performs a full dump of all goroutines. On the globalstats boxes, this isn't a huge problem, but since Veneur submits metrics to itself (running in the same process), that means that we could impact Veneur's ability to collect metrics about itself.
That said, it'd be pretty easy to test that out in QA and see if it'd be a problem.
Also, just in case you forgot, we aren't collecting this in Datadog, but we do have access to runtime debugging information on all globalstats boxes:
https://:8200/debug/pprof/heap?debug=1
flusher.go
Outdated
@@ -274,6 +275,14 @@ func (s *Server) reportGlobalMetricsFlushCounts(ms metricsSummary) { | |||
// flushRemote breaks up the final metrics into chunks | |||
// (to avoid hitting the size cap) and POSTs them to the remote API | |||
func (s *Server) flushRemote(finalMetrics []samplers.DDMetric) { | |||
|
|||
mem := new(runtime.MemStats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: &runtime.MemStats{}
After reading around — I was curious how you knew that! — I learned that currently Thankfully this was noticed and there's a patch in the current 1.9 work that brings this down to ~25µs which is 🎉. The downside is that measuring this as it stands will cause the data to change. Since the call itself causes a full GC, we'll artificially influence the heap. Bleh. I'll skip this, but wanted it as a baseline for a patch I'll send tomorrow. Oh well! I'll rely on the benchmark output. :) |
Reopening this so it's here when 1.9 ships in ~August… |
6afefc3
to
85b2430
Compare
Gerald Rule: Copy Observability on Veneur and Unilog pull requestscc @stripe/observability |
This got merged elsewhere. |
Summary
Emit GC metrics at each remote metrics flush.
Adds:
veneur.mem.heap_alloc_bytes
tracking the size of the allocated heap in bytesveneur.gc.number
tracking the number of GC runsveneur.gc.pause_total_ns
tracking the total duration of GC pausesMotivation
Now that we're using 1.9 RCs for our deploys, it's safe to merge this. In past versions Go would do a Stop-The-World collection when we asked for these stats. 1.9 fixes that.
We're adding this because, before now, it was difficult to measure the impact that changes to veneur's code had on production machines.
With these metrics we can determine if changes such as increasing ringbuffer sizes for traces or using a new HLL implementation in #190 have an impact either to the amount of memory allocated or the amount of time spend GCing.
These metrics will be invaluable as Veneur grows!
Test plan
Just metric emission, so should be fine.
Rollout/monitoring/revert plan
Deploy to globals and monitor. Eventually merge to fleet and let it naturally work it's way out.
r? @aditya-stripe