Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PLAT-104961] Upgrade thanos to main and v0.35.0 #26

Merged
merged 67 commits into from
Apr 5, 2024

Conversation

jnyi
Copy link
Collaborator

@jnyi jnyi commented Mar 31, 2024

See https://github.com/databricks/universe/pull/536629

Will keep writer in older version until we figure out thanos-io#7248

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

@jnyi jnyi changed the title [PLAT-104961] Test thanos latest main branch [PLAT-104961][DO NOT MERGE] Test thanos latest main branch Apr 1, 2024
@jnyi jnyi changed the title [PLAT-104961][DO NOT MERGE] Test thanos latest main branch [PLAT-104961] Upgrade thanos to main and v0.35.0 Apr 4, 2024
@jnyi jnyi requested review from hczhu-db and christopherzli April 4, 2024 18:01
jacobbaungard and others added 25 commits April 4, 2024 11:22
Forced tracing was.. Forced true always, even if the checkbox in the UI
to enable tracing was not actually checked.

Signed-off-by: Jacob Baungard Hansen <[email protected]>
Update Prometheus version to include
prometheus/prometheus#13242 which is important
for me - it unblocks further postings work.

Signed-off-by: Giedrius Statkevičius <[email protected]>
…os-io#7043)

* Make RetryError and HaltError able to be fetched for root cause

Signed-off-by: Alex Le <[email protected]>

* Added unit test

Signed-off-by: Alex Le <[email protected]>

* fix lint

Signed-off-by: Alex Le <[email protected]>

* fixed IsRetryError and IsHaltError functions

Signed-off-by: Alex Le <[email protected]>

---------

Signed-off-by: Alex Le <[email protected]>
* CI: Ensure static react-app is checked in

With this commit the CI system should fail if changes to the react-app
has been made without checking in the changes.

Signed-off-by: Jacob Baungard Hansen <[email protected]>

* Add `react-app` as dependency `check-react-app`

To ensure the react-app is rebuilt before checking for changes.

Signed-off-by: Jacob Baungard Hansen <[email protected]>

---------

Signed-off-by: Jacob Baungard Hansen <[email protected]>
Use the new TSDB flag to disable overlapping compaction to fix OOO
samples handling in the Receive component.

Signed-off-by: Giedrius Statkevičius <[email protected]>
…hanos-io#6898)

* [wip] First checkpoint

Signed-off-by: Douglas Camata <[email protected]>

* [wip] Second checkpoint

All tests passing, unit and e2e.

Signed-off-by: Douglas Camata <[email protected]>

* Small random refactors

Signed-off-by: Douglas Camata <[email protected]>

* Add some useful trace tags

Signed-off-by: Douglas Camata <[email protected]>

* Concurrent and traced local writes

Signed-off-by: Douglas Camata <[email protected]>

* Improve variable names in remote writes

Signed-off-by: Douglas Camata <[email protected]>

* Rename `newFanoutForward` function

Signed-off-by: Douglas Camata <[email protected]>

* More refactors

Signed-off-by: Douglas Camata <[email protected]>

* Fix linting issue

Signed-off-by: Douglas Camata <[email protected]>

* Add a quorum test with sloppy quorum

Signed-off-by: Douglas Camata <[email protected]>

* [wip] Try to make retries work

Signed-off-by: Douglas Camata <[email protected]>

* [wip] Checkpoint: wait group still hanging

Signed-off-by: Douglas Camata <[email protected]>

* Some refactors

Signed-off-by: Douglas Camata <[email protected]>

* Add some commented code so I don't lose it

Signed-off-by: Douglas Camata <[email protected]>

* Adapt tests

Signed-off-by: Douglas Camata <[email protected]>

* Remove sloppy quorum code

Signed-off-by: Douglas Camata <[email protected]>

* Move some code around

Signed-off-by: Douglas Camata <[email protected]>

* Remove even more leftover of sloppy quorum

Signed-off-by: Douglas Camata <[email protected]>

* Extract a type to hold function params

Signed-off-by: Douglas Camata <[email protected]>

* Remove unused struct field

Signed-off-by: Douglas Camata <[email protected]>

* Remove useless variable

Signed-off-by: Douglas Camata <[email protected]>

* Remove type that wasn't used enough

Signed-off-by: Douglas Camata <[email protected]>

* Delete function to tighten up max buffered responses

Signed-off-by: Douglas Camata <[email protected]>

* Add comments to some functions

Signed-off-by: Douglas Camata <[email protected]>

* Fix peer up check

Signed-off-by: Douglas Camata <[email protected]>

* Fix size of replication tracking slices

Signed-off-by: Douglas Camata <[email protected]>

* Rename context

Signed-off-by: Douglas Camata <[email protected]>

* Don't do local writes concurrently

Signed-off-by: Douglas Camata <[email protected]>

* Remove extra error logging

Signed-off-by: Douglas Camata <[email protected]>

* Fix syntax after merge

Signed-off-by: Douglas Camata <[email protected]>

* Add missing methods to peersContainer

Signed-off-by: Douglas Camata <[email protected]>

* Fix handler test

Signed-off-by: Douglas Camata <[email protected]>

* Reset peers state on hashring changes

Signed-off-by: Douglas Camata <[email protected]>

* Handle PR comment regarding waitgroup

Signed-off-by: Douglas Camata <[email protected]>

* Set span tags to help debug

Signed-off-by: Douglas Camata <[email protected]>

* Fix concurrency issue

We close the request as soon as quorum is reached and leave a few Go routines running to finish replication and so cleanups.

This means that the context from the HTTP request is cancelled... which ends up also cancelling the pending replication requests.

Signed-off-by: Douglas Camata <[email protected]>

* Fix request ID middleware

Signed-off-by: Douglas Camata <[email protected]>

* Fix `distributeTimeseriesToReplicas` comment

Signed-off-by: Douglas Camata <[email protected]>

* Extract var with 1-indexed replication index

Signed-off-by: Douglas Camata <[email protected]>

* Rename methods in peersContainer interface

Signed-off-by: Douglas Camata <[email protected]>

* Make peerGroup `getConnection` check if peers are up

Signed-off-by: Douglas Camata <[email protected]>

* Remove yet one more not useful log

Signed-off-by: Douglas Camata <[email protected]>

* Remove logger from `h.sendWrites`

Signed-off-by: Douglas Camata <[email protected]>

---------

Signed-off-by: Douglas Camata <[email protected]>
1、In the replace of go.mod, due to weaveworks/common#239, The grpc version is 1.45.0, but there are vulnerabilities in this version. In order to fix CVE-2023-44478, the grpc version needs to be upgraded to 1.57.2
2、In order to upgrade GRPC, the version of weaveworks/common also needs to be upgraded, otherwise the build will fail

Signed-off-by: hanyuting8 <[email protected]>
* Add basic acceptance tests for proxy store
* Fix bug where invalid requests got ignored because of partial response
  strategy

Signed-off-by: Michael Hoffmann <[email protected]>
* fix lazy postings with zero length

Signed-off-by: Ben Ye <[email protected]>

* changelog

Signed-off-by: Ben Ye <[email protected]>

* unit tests

Signed-off-by: Ben Ye <[email protected]>

* fix doc

Signed-off-by: Ben Ye <[email protected]>

---------

Signed-off-by: Ben Ye <[email protected]>
If the requested label is an external label and we have series matchers
we should only return results if the series matchers actually match a
series.

Signed-off-by: Michael Hoffmann <[email protected]>
…-io#7087)

Receiver hangs waiting for the HTTP Hander to shutdown if an error occurs
before Handler is initialized. This might happen, for example, if the hashring
is too small for a given replication factor.

Signed-off-by: Mikhail Nozdrachev <[email protected]>
* Update prometheus/prometheus

This commit updates prometheus/prometheus to latest main (60b6266e).

Signed-off-by: Filip Petkovski <[email protected]>

* Fix file discovery

Signed-off-by: Filip Petkovski <[email protected]>

---------

Signed-off-by: Filip Petkovski <[email protected]>
Fix bug introduced in thanos-io#6898: we
were RLock()ing twice. This leads to a deadlock in some situations.

Signed-off-by: Giedrius Statkevičius <[email protected]>
markPeerUnavailable was always taking a lock and in one case we were
calling it with a lock already taken. Fix this.

Signed-off-by: Giedrius Statkevičius <[email protected]>
the prometheus helm chart is a community maintained chart since a few
years. With that, the old example pointed to an old chart and the
provided example values aren't also working anymore.

This update the documentation.

Signed-off-by: Mario Constanti <[email protected]>
)

* Adding new method on bucketed bytes to expose used

Signed-off-by: Pedro Tanaka <[email protected]>

* Removing interface, using RWMutex

Signed-off-by: Pedro Tanaka <[email protected]>

---------

Signed-off-by: Pedro Tanaka <[email protected]>
munir131 and others added 21 commits April 4, 2024 11:29
This PR bumps the version of google.golang.org/protobuf to v1.33.0 fix a
potential vulnerability in the protojson.Unmarhsl function [1] that can
occure when unmarshaling a message with a protobuf value.

Even if the function isn't used directly in Thanos it would be safer to
just bump it directly.

[1] https://pkg.go.dev/vuln/GO-2024-2611

Signed-off-by: Daniel Mellado <[email protected]>
…r logo

fix: add anchor tag to all images
Signed-off-by: Payal17122000 <[email protected]>
Do not turn off Ruler if resolving fails. We can still (try to) evaluate
rules even if Alertmanager is not available.

Signed-off-by: Giedrius Statkevičius <[email protected]>
With this commit we only show the tenant-ui box when enforcement of
tenancy is on, as it is not needed otherwise.

Signed-off-by: Jacob Baungard Hansen <[email protected]>
We have detected a problem in the chunk seriers merger where it will
panic in case it encounters native histogram chunks.
I am using thanos as a library for a project and wanted to use the
penalty function to dedup blocks from Prometheus instances.

Signed-off-by: Pedro Tanaka <[email protected]>
Signed-off-by: Helia Barroso <[email protected]>
Co-authored-by: Helia Barroso <[email protected]>
* Add support for TSDB selector in querier

This PR allows using the query distributed mode against a set of multi-tenant receivers
as described in https://github.com/thanos-io/thanos/blob/main/docs/proposals-done/202301-distributed-query-execution.md#distributed-execution-against-receive-components.

The feature is enabled by a selector.relabel-config flag in the Query component
which allows it to select a subset of TSDBs to query based on their external labels.

Signed-off-by: Filip Petkovski <[email protected]>

* Add CHANGELOG entry and fix docs

Signed-off-by: Filip Petkovski <[email protected]>

* Fix tests

Signed-off-by: Filip Petkovski <[email protected]>

* Add comments

Signed-off-by: Filip Petkovski <[email protected]>

* Add test case for MatchersForLabelSets

Signed-off-by: Filip Petkovski <[email protected]>

* Fix failing test

Signed-off-by: Filip Petkovski <[email protected]>

* Use an unbuffered channel

Signed-off-by: Filip Petkovski <[email protected]>

* Change flag description

Signed-off-by: Filip Petkovski <[email protected]>

* Remove parameter from ServerAsClient

Signed-off-by: Filip Petkovski <[email protected]>

---------

Signed-off-by: Filip Petkovski <[email protected]>
* Update thanos-io/promql-engine

This commit updates the promql-engine module to latest main and modifies
to remote engine based on the breaking change.

Signed-off-by: Filip Petkovski <[email protected]>

* Fix lint

Signed-off-by: Filip Petkovski <[email protected]>

---------

Signed-off-by: Filip Petkovski <[email protected]>
* add username cfg to rueidis client

Signed-off-by: Thibault Mange <[email protected]>

* update changelog

Signed-off-by: Thibault Mange <[email protected]>

---------

Signed-off-by: Thibault Mange <[email protected]>
* feat(ui): added BlockSizeStats calculation to blocks page

A block can have a list of contained files set in `.thanos.files`.
If the `files` array is set, all referenced files with `size_bytes` set are counted:
- sum of all `chunk/*` file sizes
- size of index file
- total size (sum of both)

Shows statistics about the selected block in the block details view:
- Total size of block
- Size of index (and percentage of total)
- Size of all chunks (and percentage of total)
- Daily growth, based on total size and block duration

Output is humanized up to Pebibytes and fixed to two decimal places;
raw bytes are accessible through mouse over / title text.

Signed-off-by: Markus Möslinger <[email protected]>

* feat(ui): added aggregated BlockSizeStats to blocks row title

Added total size of all blocks from a source to the row title, beneath the source name.

The shown total size is humanized up to pebibytes and fixed to two decimal places;
raw bytes value is accessible through mouse over / title text.

The shown value will refresh with selected compaction levels, but doesn't take block filter into account.

I thought about showing daily growth as well, but just summing all milliseconds of all blocks doesn't work with overlapping blocks / multiple resolutions.

Signed-off-by: Markus Möslinger <[email protected]>

* chore(docs): added UI block size PR to CHANGELOG.md

Signed-off-by: Markus Möslinger <[email protected]>

* chore(ui): removed comments

Automatic code formatting duplicated some comments near import statements.

Signed-off-by: Markus Möslinger <[email protected]>

---------

Signed-off-by: Markus Möslinger <[email protected]>
…-io#7220)

* fix lazy expanded postings cache and bug of non equal matcher with non existent values

Signed-off-by: Ben Ye <[email protected]>

* test case for remove keys noop

Signed-off-by: Ben Ye <[email protected]>

* add promqlsmith fuzz test

Signed-off-by: Ben Ye <[email protected]>

* update

Signed-off-by: Ben Ye <[email protected]>

* changelog

Signed-off-by: Ben Ye <[email protected]>

* fix go mod

Signed-off-by: Ben Ye <[email protected]>

* rename test

Signed-off-by: Ben Ye <[email protected]>

* fix series request timestamp

Signed-off-by: Ben Ye <[email protected]>

* skip e2e test

Signed-off-by: Ben Ye <[email protected]>

* handle non lazy expanded case

Signed-off-by: Ben Ye <[email protected]>

* update comment

Signed-off-by: Ben Ye <[email protected]>

---------

Signed-off-by: Ben Ye <[email protected]>
* bump Prometheus version to include new label matcher regex value optimization

Signed-off-by: Ben Ye <[email protected]>

* update

Signed-off-by: Ben Ye <[email protected]>

* fix again

Signed-off-by: Ben Ye <[email protected]>

* include latest fix

Signed-off-by: Ben Ye <[email protected]>

* update go mod

Signed-off-by: Ben Ye <[email protected]>

* fix explain test

Signed-off-by: Ben Ye <[email protected]>

* fix test again

Signed-off-by: Ben Ye <[email protected]>

* update again

Signed-off-by: Ben Ye <[email protected]>

* update

Signed-off-by: Ben Ye <[email protected]>

* fix tests so far

Signed-off-by: Ben Ye <[email protected]>

* fix compactor tests

Signed-off-by: Ben Ye <[email protected]>

* use own out of order chunk index

Signed-off-by: Ben Ye <[email protected]>

---------

Signed-off-by: Ben Ye <[email protected]>
@jnyi jnyi force-pushed the pull-latest-main branch from 40fee2c to fa9882c Compare April 4, 2024 18:40
@jnyi jnyi force-pushed the pull-latest-main branch from 127e32d to 2b3c102 Compare April 4, 2024 20:41
Signed-off-by: Yi Jin <[email protected]>
@jnyi jnyi merged commit 995b2b5 into databricks:db_main Apr 5, 2024
12 checks passed
@jnyi jnyi deleted the pull-latest-main branch June 1, 2024 04:19
@jnyi jnyi restored the pull-latest-main branch June 1, 2024 04:19
@jnyi jnyi deleted the pull-latest-main branch June 1, 2024 04:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.