Skip to content

Commit

Permalink
TSDB: Fix some edge cases when OOO is enabled (prometheus#14710)
Browse files Browse the repository at this point in the history
 Fix some edge cases when OOO is enabled

Signed-off-by: Vanshikav123 <[email protected]>
Signed-off-by: Vanshika <[email protected]>
Signed-off-by: Jesus Vazquez <[email protected]>
Co-authored-by: Jesus Vazquez <[email protected]>
  • Loading branch information
Vanshikav123 and jesusvazquez authored Oct 23, 2024
1 parent 7c7116f commit cccbe72
Show file tree
Hide file tree
Showing 15 changed files with 388 additions and 10 deletions.
92 changes: 89 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,9 @@
## unreleased

* [CHANGE] Scraping: Remove implicit fallback to the Prometheus text format in case of invalid/missing Content-Type and fail the scrape instead. Add ability to specify a `fallback_scrape_protocol` in the scrape config. #15136
* [BUGFIX] PromQL: Fix stddev+stdvar aggregations to always ignore native histograms. #14941
* [BUGFIX] PromQL: Fix stddev+stdvar aggregations to treat Infinity consistently. #14941
* [ENHANCEMENT] Scraping, rules: handle targets reappearing, or rules moving group, when out-of-order is enabled. #14710
- [BUGFIX] PromQL: Fix stddev+stdvar aggregations to always ignore native histograms. #14941
- [BUGFIX] PromQL: Fix stddev+stdvar aggregations to treat Infinity consistently. #14941

## 3.0.0-beta.1 / 2024-10-09

Expand All @@ -20,7 +21,6 @@
* [ENHANCEMENT] PromQL: Introduce exponential interpolation for native histograms. #14677
* [ENHANCEMENT] TSDB: Add support for ingestion of out-of-order native histogram samples. #14850, #14546
* [ENHANCEMENT] Alerts: remove metrics for removed Alertmanagers. #13909
* [ENHANCEMENT] Scraping: support Created-Timestamp feature on native histograms. #14694
* [ENHANCEMENT] Kubernetes SD: Support sidecar containers in endpoint discovery. #14929
* [ENHANCEMENT] Consul SD: Support catalog filters. #11224
* [PERF] TSDB: Parallelize deletion of postings after head compaction. #14975
Expand All @@ -41,6 +41,10 @@ Release 3.0.0-beta.0 includes new features such as a brand new UI and UTF-8 supp

As is traditional with a beta release, we do **not** recommend users install 3.0.0-beta on critical production systems, but we do want everyone to test it out and find bugs.

<<<<<<< HEAD
<<<<<<< HEAD
=======
>>>>>>> b10c3696c (Revert "updated changelog")
* [CHANGE] UI: The old web UI has been replaced by a completely new one that is less cluttered and adds a few new features (PromLens-style tree view, better metrics explorer, "Explain" tab). However, it is still missing some features of the old UI (notably, exemplar display and heatmaps). To switch back to the old UI, you can use the feature flag `--enable-feature=old-ui` for the time being. #14872
* [CHANGE] PromQL: Range selectors and the lookback delta are now left-open, i.e. a sample coinciding with the lower time limit is excluded rather than included. #13904
* [CHANGE] Kubernetes SD: Remove support for `discovery.k8s.io/v1beta1` API version of EndpointSlice. This version is no longer served as of Kubernetes v1.25. #14365
Expand All @@ -52,6 +56,7 @@ As is traditional with a beta release, we do **not** recommend users install 3.0
* [CHANGE] Remove deprecated `remote-write-receiver`,`promql-at-modifier`, and `promql-negative-offset` feature flags. #13456, #14526
* [CHANGE] Remove deprecated `storage.tsdb.allow-overlapping-blocks`, `alertmanager.timeout`, and `storage.tsdb.retention` flags. #14640, #14643
* [ENHANCEMENT] Move AM discovery page from "Monitoring status" to "Server status". #14875
<<<<<<< HEAD
* [FEATURE] Support config reload automatically - feature flag `auto-reload-config`. #14769
* [BUGFIX] Scrape: Do not override target parameter labels with config params. #11029

Expand Down Expand Up @@ -85,6 +90,87 @@ As is traditional with a beta release, we do **not** recommend users install 3.0
* [BUGFIX] Remote-Write: Return 4xx not 5xx when timeseries has duplicate label. #14716
* [BUGFIX] Experimental Native Histograms: many fixes for incorrect results, panics, warnings. #14513, #14575, #14598, #14609, #14611, #14771, #14821
* [BUGFIX] TSDB: Only count unknown record types in `record_decode_failures_total` metric. #14042
=======
- [CHANGE] UI: The old web UI has been replaced by a completely new one that is less cluttered and adds a few new features (PromLens-style tree view, better metrics explorer, "Explain" tab). However, it is still missing some features of the old UI (notably, exemplar display and heatmaps). To switch back to the old UI, you can use the feature flag `--enable-feature=old-ui` for the time being. #14872
- [CHANGE] PromQL: Range selectors and the lookback delta are now left-open, i.e. a sample coinciding with the lower time limit is excluded rather than included. #13904
- [CHANGE] Kubernetes SD: Remove support for `discovery.k8s.io/v1beta1` API version of EndpointSlice. This version is no longer served as of Kubernetes v1.25. #14365
- [CHANGE] Kubernetes SD: Remove support for `networking.k8s.io/v1beta1` API version of Ingress. This version is no longer served as of Kubernetes v1.22. #14365
- [CHANGE] UTF-8: Enable UTF-8 support by default. Prometheus now allows all UTF-8 characters in metric and label names. The corresponding `utf8-name` feature flag has been removed. #14705
- [CHANGE] Console: Remove example files for the console feature. Users can continue using the console feature by supplying their own JavaScript and templates. #14807
- [CHANGE] SD: Enable the new service discovery manager by default. This SD manager does not restart unchanged discoveries upon reloading. This makes reloads faster and reduces pressure on service discoveries' sources. The corresponding `new-service-discovery-manager` feature flag has been removed. #14770
- [CHANGE] Agent mode has been promoted to stable. The feature flag `agent` has been removed. To run Prometheus in Agent mode, use the new `--agent` cmdline arg instead. #14747
- [CHANGE] Remove deprecated `remote-write-receiver`,`promql-at-modifier`, and `promql-negative-offset` feature flags. #13456, #14526
- [CHANGE] Remove deprecated `storage.tsdb.allow-overlapping-blocks`, `alertmanager.timeout`, and `storage.tsdb.retention` flags. #14640, #14643
- [ENHANCEMENT] Move AM discovery page from "Monitoring status" to "Server status". #14875
- [BUGFIX] Scrape: Do not override target parameter labels with config params. #11029

## 2.55.0-rc.0 / 2024-09-20

- [FEATURE] Support UTF-8 characters in label names - feature flag `utf8-names`. #14482, #14880, #14736, #14727
- [FEATURE] Support config reload automatically - feature flag `auto-reload-config`. #14769
- [FEATURE] Scraping: Add the ability to set custom `http_headers` in config. #14817
- [FEATURE] Scraping: Support feature flag `created-timestamp-zero-ingestion` in OpenMetrics. #14356, #14815
- [FEATURE] Scraping: `scrape_failure_log_file` option to log failures to a file. #14734
- [FEATURE] OTLP receiver: Optional promotion of resource attributes to series labels. #14200
- [FEATURE] Remote-Write: Support Google Cloud Monitoring authorization. #14346
- [FEATURE] Promtool: `tsdb create-blocks` new option to add labels. #14403
- [FEATURE] Promtool: `promtool test` adds `--junit` flag to format results. #14506
- [ENHANCEMENT] OTLP receiver: Warn on exponential histograms with zero count and non-zero sum. #14706
- [ENHANCEMENT] OTLP receiver: Interrupt translation on context cancellation/timeout. #14612
- [ENHANCEMENT] Remote Read client: Enable streaming remote read if the server supports it. #11379
- [ENHANCEMENT] Remote-Write: Don't reshard if we haven't successfully sent a sample since last update. #14450
- [ENHANCEMENT] PromQL: Delay deletion of `__name__` label to the end of the query evaluation. This is **experimental** and enabled under the feature-flag `promql-delayed-name-removal`. #14477
- [ENHANCEMENT] PromQL: Experimental `sort_by_label` and `sort_by_label_desc` sort by all labels when label is equal. #14655
- [ENHANCEMENT] PromQL: Clarify error message logged when Go runtime panic occurs during query evaluation. #14621
- [ENHANCEMENT] PromQL: Use Kahan summation for better accuracy in `avg` and `avg_over_time`. #14413
- [ENHANCEMENT] Tracing: Improve PromQL tracing, including showing the operation performed for aggregates, operators, and calls. #14816
- [ENHANCEMENT] API: Support multiple listening addresses. #14665
- [ENHANCEMENT] TSDB: Backward compatibility with upcoming index v3. #14934
- [PERF] TSDB: Query in-order and out-of-order series together. #14354, #14693, #14714, #14831, #14874, #14948
- [PERF] TSDB: Streamline reading of overlapping out-of-order head chunks. #14729
- [BUGFIX] SD: Fix dropping targets (with feature flag `new-service-discovery-manager`). #13147
- [BUGFIX] SD: Stop storing stale targets (with feature flag `new-service-discovery-manager`). #13622
- [BUGFIX] Scraping: exemplars could be dropped in protobuf scraping. #14810
- [BUGFIX] Remote-Write: fix metadata sending for experimental Remote-Write V2. #14766
- [BUGFIX] Remote-Write: Return 4xx not 5xx when timeseries has duplicate label. #14716
- [BUGFIX] Experimental Native Histograms: many fixes for incorrect results, panics, warnings. #14513, #14575, #14598, #14609, #14611, #14771, #14821
- [BUGFIX] TSDB: Only count unknown record types in `record_decode_failures_total` metric. #14042
>>>>>>> 58173ab1e (updated changelog)
=======
* [BUGFIX] Scrape: Do not override target parameter labels with config params. #11029

## 2.55.0-rc.0 / 2024-09-20

* [FEATURE] Support UTF-8 characters in label names - feature flag `utf8-names`. #14482, #14880, #14736, #14727
* [FEATURE] Support config reload automatically - feature flag `auto-reload-config`. #14769
* [FEATURE] Scraping: Add the ability to set custom `http_headers` in config. #14817
* [FEATURE] Scraping: Support feature flag `created-timestamp-zero-ingestion` in OpenMetrics. #14356, #14815
* [FEATURE] Scraping: `scrape_failure_log_file` option to log failures to a file. #14734
* [FEATURE] OTLP receiver: Optional promotion of resource attributes to series labels. #14200
* [FEATURE] Remote-Write: Support Google Cloud Monitoring authorization. #14346
* [FEATURE] Promtool: `tsdb create-blocks` new option to add labels. #14403
* [FEATURE] Promtool: `promtool test` adds `--junit` flag to format results. #14506
* [ENHANCEMENT] OTLP receiver: Warn on exponential histograms with zero count and non-zero sum. #14706
* [ENHANCEMENT] OTLP receiver: Interrupt translation on context cancellation/timeout. #14612
* [ENHANCEMENT] Remote Read client: Enable streaming remote read if the server supports it. #11379
* [ENHANCEMENT] Remote-Write: Don't reshard if we haven't successfully sent a sample since last update. #14450
* [ENHANCEMENT] PromQL: Delay deletion of `__name__` label to the end of the query evaluation. This is **experimental** and enabled under the feature-flag `promql-delayed-name-removal`. #14477
* [ENHANCEMENT] PromQL: Experimental `sort_by_label` and `sort_by_label_desc` sort by all labels when label is equal. #14655
* [ENHANCEMENT] PromQL: Clarify error message logged when Go runtime panic occurs during query evaluation. #14621
* [ENHANCEMENT] PromQL: Use Kahan summation for better accuracy in `avg` and `avg_over_time`. #14413
* [ENHANCEMENT] Tracing: Improve PromQL tracing, including showing the operation performed for aggregates, operators, and calls. #14816
* [ENHANCEMENT] API: Support multiple listening addresses. #14665
* [ENHANCEMENT] TSDB: Backward compatibility with upcoming index v3. #14934
* [PERF] TSDB: Query in-order and out-of-order series together. #14354, #14693, #14714, #14831, #14874, #14948
* [PERF] TSDB: Streamline reading of overlapping out-of-order head chunks. #14729
* [BUGFIX] SD: Fix dropping targets (with feature flag `new-service-discovery-manager`). #13147
* [BUGFIX] SD: Stop storing stale targets (with feature flag `new-service-discovery-manager`). #13622
* [BUGFIX] Scraping: exemplars could be dropped in protobuf scraping. #14810
* [BUGFIX] Remote-Write: fix metadata sending for experimental Remote-Write V2. #14766
* [BUGFIX] Remote-Write: Return 4xx not 5xx when timeseries has duplicate label. #14716
* [BUGFIX] Experimental Native Histograms: many fixes for incorrect results, panics, warnings. #14513, #14575, #14598, #14609, #14611, #14771, #14821
* [BUGFIX] TSDB: Only count unknown record types in `record_decode_failures_total` metric. #14042
>>>>>>> b10c3696c (Revert "updated changelog")
## 2.54.1 / 2024-08-27

Expand Down
3 changes: 3 additions & 0 deletions cmd/prometheus/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -1639,6 +1639,9 @@ func (s *readyStorage) Appender(ctx context.Context) storage.Appender {

type notReadyAppender struct{}

// SetOptions does nothing in this appender implementation.
func (n notReadyAppender) SetOptions(opts *storage.AppendOptions) {}

func (n notReadyAppender) Append(ref storage.SeriesRef, l labels.Labels, t int64, v float64) (storage.SeriesRef, error) {
return 0, tsdb.ErrNotReady
}
Expand Down
5 changes: 5 additions & 0 deletions rules/fixtures/rules1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
groups:
- name: test_1
rules:
- record: test_2
expr: vector(2)
4 changes: 4 additions & 0 deletions rules/group.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ type Group struct {

// concurrencyController controls the rules evaluation concurrency.
concurrencyController RuleConcurrencyController
appOpts *storage.AppendOptions
}

// GroupEvalIterationFunc is used to implement and extend rule group
Expand Down Expand Up @@ -145,6 +146,7 @@ func NewGroup(o GroupOptions) *Group {
metrics: metrics,
evalIterationFunc: evalIterationFunc,
concurrencyController: concurrencyController,
appOpts: &storage.AppendOptions{DiscardOutOfOrder: true},
}
}

Expand Down Expand Up @@ -564,6 +566,7 @@ func (g *Group) Eval(ctx context.Context, ts time.Time) {
if s.H != nil {
_, err = app.AppendHistogram(0, s.Metric, s.T, nil, s.H)
} else {
app.SetOptions(g.appOpts)
_, err = app.Append(0, s.Metric, s.T, s.F)
}

Expand Down Expand Up @@ -660,6 +663,7 @@ func (g *Group) cleanupStaleSeries(ctx context.Context, ts time.Time) {
return
}
app := g.opts.Appendable.Appender(ctx)
app.SetOptions(g.appOpts)
queryOffset := g.QueryOffset()
for _, s := range g.staleSeries {
// Rule that produced series no longer configured, mark it stale.
Expand Down
47 changes: 47 additions & 0 deletions rules/manager_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1195,6 +1195,53 @@ func countStaleNaN(t *testing.T, st storage.Storage) int {
return c
}

func TestRuleMovedBetweenGroups(t *testing.T) {
if testing.Short() {
t.Skip("skipping test in short mode.")
}

storage := teststorage.New(t, 600000)
defer storage.Close()
opts := promql.EngineOpts{
Logger: nil,
Reg: nil,
MaxSamples: 10,
Timeout: 10 * time.Second,
}
engine := promql.NewEngine(opts)
ruleManager := NewManager(&ManagerOptions{
Appendable: storage,
Queryable: storage,
QueryFunc: EngineQueryFunc(engine, storage),
Context: context.Background(),
Logger: promslog.NewNopLogger(),
})
var stopped bool
ruleManager.start()
defer func() {
if !stopped {
ruleManager.Stop()
}
}()

rule2 := "fixtures/rules2.yaml"
rule1 := "fixtures/rules1.yaml"

// Load initial configuration of rules2
require.NoError(t, ruleManager.Update(1*time.Second, []string{rule2}, labels.EmptyLabels(), "", nil))

// Wait for rule to be evaluated
time.Sleep(3 * time.Second)

// Reload configuration of rules1
require.NoError(t, ruleManager.Update(1*time.Second, []string{rule1}, labels.EmptyLabels(), "", nil))

// Wait for rule to be evaluated in new location and potential staleness marker
time.Sleep(3 * time.Second)

require.Equal(t, 0, countStaleNaN(t, storage)) // Not expecting any stale markers.
}

func TestGroupHasAlertingRules(t *testing.T) {
tests := []struct {
group *Group
Expand Down
4 changes: 4 additions & 0 deletions scrape/helpers_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ func (a nopAppendable) Appender(_ context.Context) storage.Appender {

type nopAppender struct{}

func (a nopAppender) SetOptions(opts *storage.AppendOptions) {}

func (a nopAppender) Append(storage.SeriesRef, labels.Labels, int64, float64) (storage.SeriesRef, error) {
return 0, nil
}
Expand Down Expand Up @@ -114,6 +116,8 @@ type collectResultAppender struct {
pendingMetadata []metadata.Metadata
}

func (a *collectResultAppender) SetOptions(opts *storage.AppendOptions) {}

func (a *collectResultAppender) Append(ref storage.SeriesRef, lset labels.Labels, t int64, v float64) (storage.SeriesRef, error) {
a.mtx.Lock()
defer a.mtx.Unlock()
Expand Down
4 changes: 3 additions & 1 deletion scrape/scrape.go
Original file line number Diff line number Diff line change
Expand Up @@ -1864,7 +1864,9 @@ loop:
if err == nil {
sl.cache.forEachStale(func(lset labels.Labels) bool {
// Series no longer exposed, mark it stale.
app.SetOptions(&storage.AppendOptions{DiscardOutOfOrder: true})
_, err = app.Append(0, lset, defTime, math.Float64frombits(value.StaleNaN))
app.SetOptions(nil)
switch {
case errors.Is(err, storage.ErrOutOfOrderSample), errors.Is(err, storage.ErrDuplicateSampleForTimestamp):
// Do not count these in logging, as this is expected if a target
Expand Down Expand Up @@ -1970,7 +1972,7 @@ func (sl *scrapeLoop) report(app storage.Appender, start time.Time, duration tim

func (sl *scrapeLoop) reportStale(app storage.Appender, start time.Time) (err error) {
ts := timestamp.FromTime(start)

app.SetOptions(&storage.AppendOptions{DiscardOutOfOrder: true})
stale := math.Float64frombits(value.StaleNaN)
b := labels.NewBuilder(labels.EmptyLabels())

Expand Down
Loading

0 comments on commit cccbe72

Please sign in to comment.