Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/improbable-eng/thanos int…
Browse files Browse the repository at this point in the history
…o add_store_status_metric

# Conflicts:
#	pkg/query/storeset.go
  • Loading branch information
mreichardt95 committed Apr 9, 2019
2 parents 619fda7 + c0bfe35 commit ee001d9
Show file tree
Hide file tree
Showing 86 changed files with 4,223 additions and 1,428 deletions.
4 changes: 2 additions & 2 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ jobs:
publish_master:
docker:
# Available from https://hub.docker.com/r/circleci/golang/
- image: circleci/golang:1.10
- image: circleci/golang:1.12
working_directory: /go/src/github.com/improbable-eng/thanos
steps:
- checkout
Expand All @@ -76,7 +76,7 @@ jobs:
publish_release:
docker:
# Available from https://hub.docker.com/r/circleci/golang/
- image: circleci/golang:1.10
- image: circleci/golang:1.12
working_directory: /go/src/github.com/improbable-eng/thanos
steps:
- checkout
Expand Down
Empty file added .dep-finished
Empty file.
10 changes: 5 additions & 5 deletions .promu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ build:
path: ./cmd/thanos
flags: -a -tags netgo
ldflags: |
-X {{repoPath}}/vendor/github.com/prometheus/common/version.Version={{.Version}}
-X {{repoPath}}/vendor/github.com/prometheus/common/version.Revision={{.Revision}}
-X {{repoPath}}/vendor/github.com/prometheus/common/version.Branch={{.Branch}}
-X {{repoPath}}/vendor/github.com/prometheus/common/version.BuildUser={{user}}@{{host}}
-X {{repoPath}}/vendor/github.com/prometheus/common/version.BuildDate={{date "20060102-15:04:05"}}
-X github.com/prometheus/common/version.Version={{.Version}}
-X github.com/prometheus/common/version.Revision={{.Revision}}
-X github.com/prometheus/common/version.Branch={{.Branch}}
-X github.com/prometheus/common/version.BuildUser={{user}}@{{host}}
-X github.com/prometheus/common/version.BuildDate={{date "20060102-15:04:05"}}
crossbuild:
platforms:
- linux/amd64
Expand Down
46 changes: 40 additions & 6 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,48 @@ NOTE: As semantic versioning states all 0.y.z releases can contain breaking chan
We use *breaking* word for marking changes that are not backward compatible (relates only to v0.y.z releases.)

## Unreleased

### Added
- [#811](https://github.com/improbable-eng/thanos/pull/811) Remote write receiver
- [#798](https://github.com/improbable-eng/thanos/pull/798) Ability to limit the maximum concurrent about of Series() calls in Thanos Store and the maximum amount of samples.
- [#910](https://github.com/improbable-eng/thanos/pull/910) Query's stores UI page is now sorted by type and old DNS or File SD stores are removed after 5 minutes (configurable via the new `--store.unhealthy-timeout=5m` flag).
- [#900](https://github.com/improbable-eng/thanos/pull/900) Query: New `thanos_store_up` metric indicates status for each node configured.

New options:

* `--store.grpc.series-sample-limit` limits the amount of samples that might be retrieved on a single Series() call. By default it is 0. Consider enabling it by setting it to more than 0 if you are running on limited resources.
* `--store.grpc.series-max-concurrency` limits the number of concurrent Series() calls in Thanos Store. By default it is 20. Considering making it lower or bigger depending on the scale of your deployment.

New metrics:
* `thanos_bucket_store_queries_dropped_total` shows how many queries were dropped due to the samples limit;
* `thanos_bucket_store_queries_concurrent_max` is a constant metric which shows how many Series() calls can concurrently be executed by Thanos Store;
* `thanos_bucket_store_queries_in_flight` shows how many queries are currently "in flight" i.e. they are being executed;
* `thanos_bucket_store_gate_duration_seconds` shows how many seconds it took for queries to pass through the gate in both cases - when that fails and when it does not.

New tracing span:
* `store_query_gate_ismyturn` shows how long it took for a query to pass (or not) through the gate.

:warning: **WARNING** :warning: #798 adds a new default limit to Thanos Store: `--store.grpc.series-max-concurrency`. Most likely you will want to make it the same as `--query.max-concurrent` on Thanos Query.

- [#970](https://github.com/improbable-eng/thanos/pull/970) Added `PartialResponseStrategy` field for `RuleGroups` for `Ruler`.

### Changed
- [#970](https://github.com/improbable-eng/thanos/pull/970) Deprecated partial_response_disabled proto field. Added partial_response_strategy instead. Both in gRPC and Query API.
- [#970](https://github.com/improbable-eng/thanos/pull/970) No `PartialResponseStrategy` field for `RuleGroups` by default means `abort` strategy (old PartialResponse disabled) as this is recommended option for Rules and alerts.

### Fixed
- [#921](https://github.com/improbable-eng/thanos/pull/921) `thanos_objstore_bucket_last_successful_upload_time` now does not appear when no blocks have been uploaded so far
- [#966](https://github.com/improbable-eng/thanos/pull/966) Bucket: verify no longer warns about overlapping blocks, that overlap `0s`

## [v0.3.2](https://github.com/improbable-eng/thanos/releases/tag/v0.3.2) - 2019.03.04

### Added
- [#851](https://github.com/improbable-eng/thanos/pull/851) New read API endpoint for api/v1/rules and api/v1/alerts.
- [#873](https://github.com/improbable-eng/thanos/pull/873) Store: fix set index cache LRU.
- [#873](https://github.com/improbable-eng/thanos/pull/873) Store: fix set index cache LRU

:warning: **WARNING** :warning: #873 fix fixes actual handling of `index-cache-size`. Handling of limit for this cache was
broken so it was unbounded all the time. From this release actual value matters and is extremely low by default. To "revert"
the old behaviour (no boundary), use a large enough value.

### Fixed
- [#833](https://github.com/improbable-eng/thanos/issues/833) Store Gateway matcher regression for intersecting with empty posting.
Expand Down Expand Up @@ -47,14 +81,14 @@ We use *breaking* word for marking changes that are not backward compatible (rel
- [#649](https://github.com/improbable-eng/thanos/issues/649) - Fixed store label values api to add also external label values.
- [#396](https://github.com/improbable-eng/thanos/issues/396) - Fixed sidecar logic for proxying series that has more than 2^16 samples from Prometheus.
- [#732](https://github.com/improbable-eng/thanos/pull/732) - Fixed S3 authentication sequence. You can see new sequence enumerated [here](https://github.com/improbable-eng/thanos/blob/master/docs/storage.md#aws-s3-configuration)
- [#745](https://github.com/improbable-eng/thanos/pull/745) - Fixed race conditions and edge cases for Thanos Querier fanout logic.
- [#745](https://github.com/improbable-eng/thanos/pull/745) - Fixed race conditions and edge cases for Thanos Querier fanout logic.
- [#651](https://github.com/improbable-eng/thanos/issues/651) - Fixed index cache when asked buffer size is bigger than cache max size.

### Changed

- [#529](https://github.com/improbable-eng/thanos/pull/529) Massive improvement for compactor. Downsampling memory consumption was reduce to only store labels and single chunks per each series.
- Qurerier UI: Store page now shows the store APIs per component type.
- Prometheus and TSDB deps are now up to date with ~2.7.0 Prometheus version. Lot's of things has changed. See details [here #704](https://github.com/improbable-eng/thanos/pull/704) Known changes that affects us:
- Prometheus and TSDB deps are now up to date with ~2.7.0 Prometheus version. Lot's of things has changed. See details [here #704](https://github.com/improbable-eng/thanos/pull/704) Known changes that affects us:
- prometheus/prometheus/discovery/file
- [ENHANCEMENT] Discovery: Improve performance of previously slow updates of changes of targets. #4526
- [BUGFIX] Wait for service discovery to stop before exiting #4508 ??
Expand All @@ -75,11 +109,11 @@ We use *breaking* word for marking changes that are not backward compatible (rel
- S3 provider:
- Added `put_user_metadata` option to config.
- Added `insecure_skip_verify` option to config.

### Deprecated

- Tests against Prometheus below v2.2.1. This does not mean *lack* of support for those. Only that we don't tests the compatibility anymore. See [#758](https://github.com/improbable-eng/thanos/issues/758) for details.

## [v0.2.1](https://github.com/improbable-eng/thanos/releases/tag/v0.2.1) - 2018.12.27

### Added
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ The philosophy of Thanos and our community is borrowing much from UNIX philosoph
* Write components that work together
* e.g. blocks should be stored in native prometheus format
* Make it easy to read, write, and, run components
* e.g. reduce complexity in system design and implementation
* e.g. reduce complexity in system design and implementation

## Adding New Features / Components

Expand Down
21 changes: 17 additions & 4 deletions MAINTAINERS.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,22 @@
# Core Maintainers of this repository

| Name | Email | GitHub |
|---------------------|-----------------------|------------------------------------------|
| Bartłomiej Płotka | [email protected] | [@bwplotka](https://github.com/bwplotka) |
| Dominic Green | [email protected] | [@domgreen](https://github.com/domgreen) |
| Name | Email | Slack | GitHub |
|-----------------------|------------------------|--------------------------|------------------------------------------------------------|
| Bartłomiej Płotka | [email protected] | `@bwplotka` | [@bwplotka](https://github.com/bwplotka) |
| Dominic Green | [email protected] | `@domgreen` | [@domgreen](https://github.com/domgreen) |
| Frederic Branczyk | [email protected] | `@brancz` | [@brancz](https://github.com/brancz) |
| Giedrius Statkevičius | [email protected] | `@Giedrius Statkevičius` | [@GiedriusS](https://github.com/GiedriusS) |
| Povilas Versockas | [email protected] | `@povilasv` | [@povilasv](https://github.com/povilasv)

We are bunch of people from different companies with various interests and skills.
We are from different parts of Europe: Germany, Lithuania, Poland and UK.
We have something in common though: We all share the love for OpenSource, Golang, Prometheus, :coffee: and Observability topics.

As either Software Developers or SRE (or both!) we've chosen to maintain (mostly in our free time) Thanos, the de facto way to scale awesome [Prometheus](https://prometheus.io) project.

Feel free to contact us (preferably on Slack) anytime for feedback, questions or :beers:/:coffee:/:tea:.

Especially feedback, please share if you have ideas what we can do better!

## Storage plugins maintainers

Expand Down
10 changes: 5 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
PREFIX ?= $(shell pwd)
FILES ?= $(shell find . -type f -name '*.go' -not -path "./vendor/*")
DIRECTORIES ?= $(shell find . -path './*' -prune -type d -not -path "./vendor")
DOCKER_IMAGE_NAME ?= thanos
DOCKER_IMAGE_TAG ?= $(subst /,-,$(shell git rev-parse --abbrev-ref HEAD))-$(shell date +%Y-%m-%d)-$(shell git rev-parse --short HEAD)
# $GOPATH/bin might not be in $PATH, so we can't assume `which` would give use
Expand Down Expand Up @@ -118,7 +118,7 @@ docs: $(EMBEDMD) build
.PHONY: check-docs
check-docs: $(EMBEDMD) $(LICHE) build
@EMBEDMD_BIN="$(EMBEDMD)" scripts/genflagdocs.sh check
@$(LICHE) --recursive docs --document-root .
@$(LICHE) --recursive docs --exclude "cloud.tencent.com" --document-root .

# errcheck performs static analysis and returns error if any of the errors is not checked.
.PHONY: errcheck
Expand All @@ -132,7 +132,7 @@ errcheck: $(ERRCHECK)
.PHONY: format
format: $(GOIMPORTS)
@echo ">> formatting code"
@$(GOIMPORTS) -w $(FILES)
@$(GOIMPORTS) -w $(DIRECTORIES)

# proto generates golang files from Thanos proto files.
.PHONY: proto
Expand Down Expand Up @@ -183,8 +183,8 @@ go-mod-tidy: check-git check-bzr
@go mod tidy

.PHONY: check-go-mod
check-go-mod: go-mod-tidy
@git diff --exit-code go.mod go.sum > /dev/null || echo >&2 "go.mod and/or go.sum have uncommited changes. See CONTRIBUTING.md."
check-go-mod:
@go mod verify

# tooling deps. TODO(bwplotka): Pin them all to certain version!
.PHONY: check-git
Expand Down
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@

## Overview

Thanos is a set of components that can be composed into a highly available metric
system with unlimited storage capacity, which can be added seamlessly on top of existing
Prometheus deployments.
Thanos is a set of components that can be composed into a highly available metric
system with unlimited storage capacity, which can be added seamlessly on top of existing
Prometheus deployments.

Thanos leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric
Thanos leverages the Prometheus 2.0 storage format to cost-efficiently store historical metric
data in any object storage while retaining fast query latencies. Additionally, it provides
a global query view across all Prometheus installations and can merge data from Prometheus
a global query view across all Prometheus installations and can merge data from Prometheus
HA pairs on the fly.

Concretely the aims of the project are:
Expand Down Expand Up @@ -65,7 +65,7 @@ Contributions are very welcome! See our [CONTRIBUTING.md](CONTRIBUTING.md) for m

## Community

Thanos is an open source project and we welcome new contributers and members
Thanos is an open source project and we welcome new contributers and members
of the community. Here are ways to get in touch with the community:

* Slack: [#thanos](https://join.slack.com/t/improbable-eng/shared_invite/enQtMzQ1ODcyMzQ5MjM4LWY5ZWZmNGM2ODc5MmViNmQ3ZTA3ZTY3NzQwOTBlMTkzZmIxZTIxODk0OWU3YjZhNWVlNDU3MDlkZGViZjhkMjc)
Expand Down
4 changes: 2 additions & 2 deletions benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,14 @@ We started a prometheus instance with thanos sidecar, scraping 50 targets every
* 0.022 average query time (min: 0.018s, max: 0.073s)
* *0.812 queries per second*

This shows an added latency of *85-150%* when using thanos query. This is more or less expected, as network operations will have to be done twice. We took a profile of thanos query while under load, finding that about a third of the time is being spent evaluating the promql queries. We are looking into including newer versions of the prometheus libraries into thanos that include optimisations to this component.
This shows an added latency of *85-150%* when using thanos query. This is more or less expected, as network operations will have to be done twice. We took a profile of thanos query while under load, finding that about a third of the time is being spent evaluating the promql queries. We are looking into including newer versions of the prometheus libraries into thanos that include optimisations to this component.

Although we have not tested federated prometheus in the same controlled environment, in theory it should incur a similar overhead, as we will still be performing two network hops.

### Store performance
To test the store component, we generated 1 year of simulated metrics (100 timeseries taking random values every 15s, a total of 210 million samples). We were able to run heavy queries to touch all 210 million of these samples, e.g. a sum over 100 timeseries takes about 34.6 seconds. Smaller queries, for example fetching 1 year of samples from a single timeseries, were able to run in about 500 milliseconds.

When enabling downsampling over these timeseries, we were able to reduce query times by over 90%.
When enabling downsampling over these timeseries, we were able to reduce query times by over 90%.

### Ingestion
To try to find the limits of a single thanos-query service, we spun up a number prometheus instances, each scraping 10 metric-producing endpoints every second. We attached a thanos-query endpoint in front of these scrapers, and ran queries that would touch fetch a most recent metric from each of them. Each metric producing endpoint would serve 100 metrics, taking random values, and the query would fetch the most recent value from each of these metrics:
Expand Down
8 changes: 7 additions & 1 deletion cmd/thanos/compact.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,9 @@ func registerCompact(m map[string]setupFunc, app *kingpin.Application, name stri

haltOnError := cmd.Flag("debug.halt-on-error", "Halt the process if a critical compaction error is detected.").
Hidden().Default("true").Bool()
acceptMalformedIndex := cmd.Flag("debug.accept-malformed-index",
"Compaction index verification will ignore out of order label names.").
Hidden().Default("false").Bool()

httpAddr := regHTTPAddrFlag(cmd)

Expand Down Expand Up @@ -102,6 +105,7 @@ func registerCompact(m map[string]setupFunc, app *kingpin.Application, name stri
objStoreConfig,
time.Duration(*syncDelay),
*haltOnError,
*acceptMalformedIndex,
*wait,
map[compact.ResolutionLevel]time.Duration{
compact.ResolutionLevelRaw: time.Duration(*retentionRaw),
Expand All @@ -125,6 +129,7 @@ func runCompact(
objStoreConfig *pathOrContent,
syncDelay time.Duration,
haltOnError bool,
acceptMalformedIndex bool,
wait bool,
retentionByResolution map[compact.ResolutionLevel]time.Duration,
component string,
Expand Down Expand Up @@ -162,7 +167,8 @@ func runCompact(
}
}()

sy, err := compact.NewSyncer(logger, reg, bkt, syncDelay, blockSyncConcurrency)
sy, err := compact.NewSyncer(logger, reg, bkt, syncDelay,
blockSyncConcurrency, acceptMalformedIndex)
if err != nil {
return errors.Wrap(err, "create syncer")
}
Expand Down
26 changes: 19 additions & 7 deletions cmd/thanos/flags.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,25 +17,37 @@ import (
"gopkg.in/alecthomas/kingpin.v2"
)

func regCommonServerFlags(cmd *kingpin.CmdClause) (
func regGRPCFlags(cmd *kingpin.CmdClause) (
grpcBindAddr *string,
httpBindAddr *string,
grpcTLSSrvCert *string,
grpcTLSSrvKey *string,
grpcTLSSrvClientCA *string,
peerFunc func(log.Logger, *prometheus.Registry, bool, string, bool) (cluster.Peer, error)) {

) {
grpcBindAddr = cmd.Flag("grpc-address", "Listen ip:port address for gRPC endpoints (StoreAPI). Make sure this address is routable from other components if you use gossip, 'grpc-advertise-address' is empty and you require cross-node connection.").
Default("0.0.0.0:10901").String()

grpcAdvertiseAddr := cmd.Flag("grpc-advertise-address", "Explicit (external) host:port address to advertise for gRPC StoreAPI in gossip cluster. If empty, 'grpc-address' will be used.").
String()

grpcTLSSrvCert = cmd.Flag("grpc-server-tls-cert", "TLS Certificate for gRPC server, leave blank to disable TLS").Default("").String()
grpcTLSSrvKey = cmd.Flag("grpc-server-tls-key", "TLS Key for the gRPC server, leave blank to disable TLS").Default("").String()
grpcTLSSrvClientCA = cmd.Flag("grpc-server-tls-client-ca", "TLS CA to verify clients against. If no client CA is specified, there is no client verification on server side. (tls.NoClientCert)").Default("").String()

return grpcBindAddr,
grpcTLSSrvCert,
grpcTLSSrvKey,
grpcTLSSrvClientCA
}

func regCommonServerFlags(cmd *kingpin.CmdClause) (
grpcBindAddr *string,
httpBindAddr *string,
grpcTLSSrvCert *string,
grpcTLSSrvKey *string,
grpcTLSSrvClientCA *string,
peerFunc func(log.Logger, *prometheus.Registry, bool, string, bool) (cluster.Peer, error)) {

httpBindAddr = regHTTPAddrFlag(cmd)
grpcBindAddr, grpcTLSSrvCert, grpcTLSSrvKey, grpcTLSSrvClientCA = regGRPCFlags(cmd)
grpcAdvertiseAddr := cmd.Flag("grpc-advertise-address", "Explicit (external) host:port address to advertise for gRPC StoreAPI in gossip cluster. If empty, 'grpc-address' will be used.").
String()

clusterBindAddr := cmd.Flag("cluster.address", "Listen ip:port address for gossip cluster.").
Default("0.0.0.0:10900").String()
Expand Down
1 change: 1 addition & 0 deletions cmd/thanos/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ func main() {
registerCompact(cmds, app, "compact")
registerBucket(cmds, app, "bucket")
registerDownsample(cmds, app, "downsample")
registerReceive(cmds, app, "receive")

cmd, err := app.Parse(os.Args[1:])
if err != nil {
Expand Down
Loading

0 comments on commit ee001d9

Please sign in to comment.