Add Reset Transformation #2794

ryanhall07 · 2020-10-22T19:14:14Z

What this PR does / why we need it:

This transformation resets an aggregated prometheus counter by emitting
a zero. This transformation fixes the display issue when an M3Aggregator
failover occurs. When the Add transformation is used to generate a cumulative
counter for prometheus, the values in the leader and follower are different
since they started counting at different times. When the failover happens, the
value might decrease. When a cumulative counter decreases, PromQL assumes the
counter reset. This leads to strange display issues, since the counter did not
actually reset. By explicitly reseting the counter each resolution period and
not accumulating a counter, PromQL can correclty graph the rate across failovers.

This does have the downside of an extra 0 datapoint per resolution period. The storage
cost is more than just the extra 0 since the extra 0 is stored 1 second after the actual
datapoint. This degrades the timestamp encoding since the timestamps are no longer at
a fixed interval. In practice we see a 3x increase in storage for these aggregated counters.

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

None

Does this PR require updating code package or user-facing documentation?:

None

This transformation resets an aggregated prometheus counter by emitting a zero. This transformation fixes the display issue when an M3Aggregator failover occurs. When the Add transformation is used to generate a cumulative counter for prometheus, the values in the leader and follower are different since they started counting at different times. When the failover happens, the value might decrease. When a cumulative counter decreases, PromQL assumes the counter reset. This leads to strange display issues, since the counter did not actually reset. By explicitly reseting the counter each resolution period and not accumulating a counter, PromQL can correclty graph the rate across failovers. This does have the downside of an extra 0 datapoint per resolution period. The storage cost is more than just the extra 0 since the extra 0 is stored 1 second after the actual datapoint. This degrades the timestamp encoding since the timestamps are no longer at a fixed interval. In practice we see a 3x increase in storage for these aggregated counters.

ryanhall07 · 2020-10-24T14:38:24Z

.gitattributes

@@ -2,6 +2,9 @@
 mock_*.go linguist-generated
 *_mock.go linguist-generated

+# genny files


eventually I want to fix how we generate genny files to include DO NOT EDIT, which should automatically be excluded from github. for now this quick solution works.

ryanhall07 · 2020-10-24T14:38:43Z

scripts/auto-gen-helpers.sh

@@ -1,5 +1,12 @@
 #!/bin/bash
-source "${GOPATH}/src/github.com/m3db/m3/.ci/variables.sh"
+
+if [ -z "$GOPATH" ]; then


since GOPATH is deprecated, this seems like a reasonable solution

robskillington · 2020-10-26T18:29:56Z

src/aggregator/aggregator/generic_elem.go

@@ -468,10 +468,13 @@ func (e *GenericElem) processValueWithAggregationLock(
 		discardNaNValues = e.opts.DiscardNaNAggregatedValues()
 	)
 	for aggTypeIdx, aggType := range e.aggTypes {
+		toFlush := make([]transformation.Datapoint, 0)


Can we place this slice somewhere on the element or in the locked aggregation so we can reuse it each flush? Ideally we don't need to allocate this every flush per every single series.

Alternatively, can we just create two local variables? toFlushDefault and toFlushSecondary

robskillington · 2020-10-26T18:32:15Z

src/metrics/transformation/unary_multi.go

+// aggregated counters.
+func transformReset() UnaryMultiOutputTransform {
+	return UnaryMultiOutputTransformFn(func(dp Datapoint) (Datapoint, []Datapoint) {
+		return dp, []Datapoint{{Value: 0, TimeNanos: dp.TimeNanos + int64(time.Second)}}


Given this always returns a fixed number of datapoints, perhaps we just special case the return type to be just one datapoint for now?

Concerned about this slice being heap allocated, can check with running go build -gcflags="-m" . and seeing what comes up for this file and whether the slice escapes to the heap or not.

i changed it to return a single datapoint.

however, I'm still allocating a 2 element slice when processing the flush, it just made the code easier. from using go build-gcflags="-m" the slice is not escaping. however the previous code was also not escaping with

toFlush := make([]transformation.Datapoint, 0)

…all-reset-agg-counters

robskillington · 2020-10-28T14:19:16Z

src/metrics/transformation/unary_multi_test.go

+	this, other := reset.Evaluate(dp)
+	require.Equal(t, dp, this)
+	require.Equal(t, Datapoint{Value: 0, TimeNanos: now.Add(time.Second).UnixNano()}, other)
+


nit: Remove empty line?

robskillington

LGTM

…all-reset-agg-counters

codecov · 2020-10-28T15:03:48Z

Codecov Report

Merging #2794 into master will decrease coverage by 39.7%.
The diff coverage is n/a.

@@            Coverage Diff            @@
##           master   #2794      +/-   ##
=========================================
- Coverage    72.1%   32.4%   -39.8%     
=========================================
  Files        1115     731     -384     
  Lines      100626   76036   -24590     
=========================================
- Hits        72624   24654   -47970     
- Misses      23058   48909   +25851     
+ Partials     4944    2473    -2471

Flag	Coverage Δ
#aggregator	`49.5% <ø> (-26.6%)`	⬇️
#cluster	`100.0% <ø> (+14.9%)`	⬆️
#collector	`45.6% <ø> (-38.7%)`	⬇️
#dbnode	`41.8% <ø> (-37.6%)`	⬇️
#m3em	`44.8% <ø> (-29.6%)`	⬇️
#m3ninx	`?`
#m3nsch	`?`
#metrics	`?`
#msg	`73.1% <ø> (-1.1%)`	⬇️
#query	`1.5% <ø> (-67.8%)`	⬇️
#x	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ed094e...f915f8a. Read the comment docs.

* master: Support dynamic namespace resolution for embedded coordinators (#2815) [dbnode] Add index regexp DFA compilation cache to avoid allocating DFAs for same expressions (#2814) [dbnode] Set default cache on retrieve to false, prepare testing with cache none (#2813) [tools] Output annotations as base64 (#2743) Add Reset Transformation (#2794) [large-tiles] Fix for a reverse index when querying downsampled namespace (#2808) [dbnode] Evict info files cache before index bootstrap in peers bootstrapper (#2802) [dbnode] Bump default filesystem persist rate limit (#2806) [coordinator] Add metrics and configurability of downsample and ingest writer worker pool (#2797) [dbnode] Fix concurrency granularity of seekerManager.UpdateOpenLease (#2790) [dbnode] Use correct import path for atomic library (#2801) Regenerate carbon automapping rules on namespaces changes (#2793) Enforce matching retention and index block size (#2783) [query] Add ClusterNamespacesWatcher and use to generate downsample automapper rules (#2782) [query][dbnode] Wire up etcd-backed Clusters implementation in coordinator (#2758) [query] Fix quantile() argument not being passed through (#2780) # Conflicts: # src/query/parser/promql/matchers.go

* master: [query] Improve precision for variance and stddev of equal values (#2799) Support dynamic namespace resolution for embedded coordinators (#2815) [dbnode] Add index regexp DFA compilation cache to avoid allocating DFAs for same expressions (#2814) [dbnode] Set default cache on retrieve to false, prepare testing with cache none (#2813) [tools] Output annotations as base64 (#2743) Add Reset Transformation (#2794) [large-tiles] Fix for a reverse index when querying downsampled namespace (#2808) [dbnode] Evict info files cache before index bootstrap in peers bootstrapper (#2802) [dbnode] Bump default filesystem persist rate limit (#2806) [coordinator] Add metrics and configurability of downsample and ingest writer worker pool (#2797) [dbnode] Fix concurrency granularity of seekerManager.UpdateOpenLease (#2790) [dbnode] Use correct import path for atomic library (#2801) Regenerate carbon automapping rules on namespaces changes (#2793) Enforce matching retention and index block size (#2783) [query] Add ClusterNamespacesWatcher and use to generate downsample automapper rules (#2782) [query][dbnode] Wire up etcd-backed Clusters implementation in coordinator (#2758) [query] Fix quantile() argument not being passed through (#2780) [query] Add "median" aggregation to Graphite aggregate() function (#2774) [r2] Ensure KeepOriginal is propagated when adding/reviving rules (#2796) [dbnode] Update default db read limits (#2784)

ryanhall07 force-pushed the rhall-reset-agg-counters branch 2 times, most recently from 4a26009 to a79e480 Compare October 23, 2020 19:26

ryanhall07 force-pushed the rhall-reset-agg-counters branch from a79e480 to 93210f8 Compare October 23, 2020 20:35

ryanhall07 added 2 commits October 23, 2020 13:54

Add genny to linguist-generated

8440046

Merge branch 'master' into rhall-reset-agg-counters

028dc67

ryanhall07 changed the title ~~Rhall reset agg counters~~ Add Reset Transformation Oct 24, 2020

ryanhall07 requested a review from robskillington October 24, 2020 00:56

ryanhall07 marked this pull request as ready for review October 24, 2020 00:56

ryanhall07 assigned robskillington Oct 24, 2020

ryanhall07 commented Oct 24, 2020

View reviewed changes

Merge branch 'master' into rhall-reset-agg-counters

e1273e5

robskillington reviewed Oct 26, 2020

View reviewed changes

ryanhall07 added 6 commits October 26, 2020 15:32

reivew comments

ee0fc32

Merge branch 'rhall-reset-agg-counters' of github.com:m3db/m3 into rh…

5b2a63a

…all-reset-agg-counters

Merge branch 'master' into rhall-reset-agg-counters

d3a69ea

Only allow one extra datapoint per aggregation

7c4da7d

fix test

fc4006d

Merge branch 'master' into rhall-reset-agg-counters

82734bc

robskillington reviewed Oct 28, 2020

View reviewed changes

robskillington approved these changes Oct 28, 2020

View reviewed changes

ryanhall07 added 2 commits October 28, 2020 08:00

remove space

678b4ba

Merge branch 'rhall-reset-agg-counters' of github.com:m3db/m3 into rh…

f915f8a

…all-reset-agg-counters

ryanhall07 merged commit 89215d5 into master Oct 28, 2020

ryanhall07 deleted the rhall-reset-agg-counters branch October 28, 2020 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Reset Transformation #2794

Add Reset Transformation #2794

ryanhall07 commented Oct 22, 2020 •

edited

Loading

ryanhall07 Oct 24, 2020

ryanhall07 Oct 24, 2020

robskillington Oct 26, 2020

robskillington Oct 26, 2020

ryanhall07 Oct 26, 2020

robskillington Oct 28, 2020

robskillington left a comment

codecov bot commented Oct 28, 2020 •

edited

Loading

Add Reset Transformation #2794

Add Reset Transformation #2794

Conversation

ryanhall07 commented Oct 22, 2020 • edited Loading

ryanhall07 Oct 24, 2020

Choose a reason for hiding this comment

ryanhall07 Oct 24, 2020

Choose a reason for hiding this comment

robskillington Oct 26, 2020

Choose a reason for hiding this comment

robskillington Oct 26, 2020

Choose a reason for hiding this comment

ryanhall07 Oct 26, 2020

Choose a reason for hiding this comment

robskillington Oct 28, 2020

Choose a reason for hiding this comment

robskillington left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 28, 2020 • edited Loading

Codecov Report

ryanhall07 commented Oct 22, 2020 •

edited

Loading

codecov bot commented Oct 28, 2020 •

edited

Loading