prometheus: Implement sharding mechanism #3241

brancz · 2020-05-25T07:46:39Z

This implements the long overdue sharding mechanism that allows for easily sharding a Prometheus cluster.

This is fully backward compatible, but slightly changes the significance of the replicas field in the spec. Instead of being blindly copied from the Prometheus spec to the StatefulSet spec, now the replicas in the StatefulSet are calculated by: # of shards * # of replicas, essentially meaning that .spec.replicas is now "how many replicas of each shard to create".

This is still fully contained in one StatefulSet. An instance knows it's shard by looking at its ordinal within the statefulset. An example of a 2 replicas and 2 shards configuration would result in the following pods:

~~prometheus-0 -> shard 0~~
~~prometheus-1 -> shard 0~~
~~prometheus-2 -> shard 1~~
~~prometheus-3 -> shard 1~~

Edit Nov 2:

This creates a statefulset per shard, the "0" shard defaulting to the naming before sharding was introduced to make everything backward compatible. An example of a 2 replicas and 2 shards configuration would result in the following pods:

prometheus-0 -> shard 0, replica 0
prometheus-1 -> shard 0, replica 1
prometheus-shard-1-0 -> shard 1, replica 0
prometheus-shard-1-1 -> shard 1, replica 1

This and various other configurations are present in the unit tests.

As noted in the comments, this does not take care of any resharding work, it's only about sharding the scrape work.

Closes #3130 #2590

@lilic @pgier @simonpasquier @s-urbaniak @paulfantom @metalmatze

brancz · 2020-05-25T07:50:05Z

If everyone is happy with this approach and design I'll happily add design and user docs for this.

vsliouniaev · 2020-05-25T09:23:59Z

Wow! This would probably help us quite a bit!

I'm wondering how a single statefulset will work with node drains and pod disruption budgets - is it possible that a whole shard will go down simultaneously?

brancz · 2020-05-25T09:34:57Z

Pod spread and PDB are great points, I need to think about those a bit more. It wouldn't be difficult to extract things into separate statefulsets. The logic is roughly the same (maybe even easier as it wouldn't be ordinal based). For shards==1 we would just do exactly what we do today and only for shards>1 would we create more.

simonpasquier

Neat!

simonpasquier · 2020-05-25T15:50:24Z

pkg/prometheus/promcfg.go

+		{Key: "modulus", Value: shards},
+		{Key: "action", Value: "hashmod"},
+	}, yaml.MapSlice{
+		{Key: "source_labels", Value: []string{"__hash"}},


(nit) __tmp_hash instead of __hash as per Prometheus documentation on relabeling?

brancz · 2020-05-26T12:12:32Z

I'm going to try and explore ways to do this with multiple statefulsets, as with a single statefulset we can't in a meaningful way perform pod spreading and PodDisruptionBudget. Thanks to @vsliouniaev for pointing this out. Will put this in WIP until I have something that can be reviewed.

temujin9 · 2020-06-01T20:00:14Z

@brancz We currently shard with stock Prometheus, and want to move to the operator. Following this ticket, please let me know if I can help with testing anything.

brancz · 2020-06-15T08:17:43Z

Gave this another attempt. Now each statefulset represents a shard and replicas just sets the replicas in each of those shards. There are e2e tests and they all pass, so I think this is ready for first rounds of reviews! :)

lilic · 2020-06-17T08:01:09Z

Can you rebase, thanks!

lilic · 2020-07-10T08:28:38Z

continuous-integration/travis-ci Expected — Waiting for status to be reported

Seems strange, either travis or github failed, can you push again to retrigger, thanks!

s-urbaniak · 2020-08-04T12:52:23Z

i think this is a legit failure:

 --- FAIL: TestAllNS/y/ShardingProvisioning

brancz · 2020-08-04T14:09:13Z

Yeah this is definitely a legit failure, I need to just find time to finish this up :)

s-urbaniak · 2020-08-17T08:15:18Z

Documentation/api.md

@@ -493,7 +493,8 @@ PrometheusSpec is a specification of the desired behavior of the Prometheus clus
 | image | Image if specified has precedence over baseImage, tag and sha combinations. Specifying the version is still necessary to ensure the Prometheus Operator knows what version of Prometheus is being configured. | *string | false |
 | baseImage | Base image to use for a Prometheus deployment. Deprecated: use 'image' instead | string | false |
 | imagePullSecrets | An optional list of references to secrets in the same namespace to use for pulling prometheus and alertmanager images from registries see http://kubernetes.io/docs/user-guide/images#specifying-imagepullsecrets-on-a-pod | [][v1.LocalObjectReference](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#localobjectreference-v1-core) | false |
-| replicas | Number of instances to deploy for a Prometheus deployment. | *int32 | false |
+| replicas | Number of replicas of each shard to deploy for a Prometheus deployment. Number of replicas multiplied by shards is the total number of Pods created. | *int32 | false |
+| shards | Number of shards to distribute targets onto. Number of replicas multiplied by shards is the total number of Pods created. Note that scaling down shards will not reshard data onto remaining instances, it must be manually moved. Increasing shards will not reshard data either but it will continue to be available from the same instances. To query globally use Thanos sidecar and Thanos querier or remote write data to a central location. | *int32 | false |


is it worth mentioning that we use __address__ label as the static shard key?

yes makes sense.

brancz · 2020-09-02T13:23:30Z

Choosing __address__ statically was simply a matter of not exposing configuration until there is a very good reason to do so. There is no practical reason why any two targets should have the same address (with the very rare exception of multiple metrics endpoints on the same listener, but even that should disappear in the distribution). I'm happy to reconsider this if we have a sound use case, but wanted to start with the least amount of configuration necessary.
Agreed, some documentation is deserved here. I would think a user guide would be best, as the API docs already explain the feature, or do you think it would make more sense as part of the design doc?

lilic

Do you mind rebasing, thanks!

Documentation/api.md

mcavoyk · 2020-10-28T20:16:55Z

@brancz @s-urbaniak Besides rebasing, is there any other work needed to move this PR forward? I'm very interested in trying out this feature 😄

brancz · 2020-11-01T13:08:56Z

This should be ready for review again. The rebase was a bit messy, so I will also review this myself (but unit and e2e tests are passing so that gives me some degree of confidence :) ).

brancz · 2020-11-01T16:48:39Z

I did a round of fixes, I think this should be good for everyone to review :)

brancz · 2020-11-02T09:15:40Z

I also updated the PR description to reflect the latest implementation state.

lilic

Amazing work, mainly nits and questions from my side, nothing blocking! 👏

pkg/prometheus/operator.go

pkg/prometheus/operator_test.go

lilic · 2020-11-02T09:07:34Z

pkg/prometheus/promcfg.go

@@ -1137,6 +1153,19 @@ func getLimit(user uint64, enforced *uint64) uint64 {
 	return user
 }

+func addressSharding(relabelings []yaml.MapSlice, shards int32) []yaml.MapSlice {


Nit: Seems like a confusing function name to me, had to look at the function to understand what it did, maybe something more self-descriptive like injectSharding or generateSharding wdyt?

Would a unit test be good here for this function to cover some possible edge cases, wdyt?

Agreed, something with "generate" in the function name would make more sense.

There are not really any edge cases here though that aren't covered elsewhere already, as this is really only text templating.

Renaming to generateAddressShardingRelabelingRules

lilic · 2020-11-02T09:09:59Z

pkg/prometheus/statefulset.go

 )

+func expectedStatefulSetShardNames(
+	p *monitoringv1.Prometheus,


Nit: Do we need this format, it only has one var, we can have it all in one line?

I tend to prefer this, as parameters tend to be ever-growing. No strong opinion though.

lilic · 2020-11-02T09:16:16Z

pkg/prometheus/statefulset.go

@@ -871,6 +902,14 @@ func prefixedName(name string) string {
 	return fmt.Sprintf("prometheus-%s", name)
 }

+func prometheusName(name string, shard int32) string {


Nit: Could we add this function closer to the one where it's used, or vice versa, we only seem to use it in expectedStatefulSetShardNames?

Makes sense, will do. I'll also rename it to prometheusNameByShard, which makes it a bit clearer what it does.

lilic · 2020-11-02T09:34:24Z

test/e2e/prometheus_test.go

+		t.Fatal(err)
+	}
+
+	err = wait.Poll(time.Second, 1*time.Minute, func() (bool, error) {


We timeout after 1 second?

rather after 1 minute no?

and poll once per second

The function signature is:

func Poll(interval, timeout time.Duration, condition ConditionFunc) error {

So we're checking once every second for 1 minute.

For me github said its using:

func (f *Framework) Poll(timeout, pollInterval time.Duration, pollFunc func() (bool, error)) error

prometheus-operator/test/framework/helpers.go

Line 140 in a6f0c9a

func (f *Framework) Poll(timeout, pollInterval time.Duration, pollFunc func() (bool, error)) error {

It seems like github mistook it, we are indeed using Poll from "k8s.io/apimachinery/pkg/util/wait" and not the one from our helpers.

Side note: Any reason for this? If we don't find our own helper useful and we just want to go with the apimachinery one might be nice to open issue to get rid of the custom one?

Looks like framework.Poll is actually unused. How about we remove it in a follow up?

test/e2e/prometheus_test.go

test/framework/service.go

test/framework/pod.go

s-urbaniak · 2020-11-02T09:59:04Z

LGTM modulo @lilic 's review comments, thank you! 🎉

lilic

lgtm

🎉

"To run Prometheus in a highly available manner, two (or more) instances need to be running with the same configuration, that means they scrape the same targets, which in turn means they will have the same data in memory and on disk, which in turn means they are answering requests the same way. In reality this is not entirely true, as the scrape cycles can be slightly different, and therefore the recorded data can be slightly different. This means that single requests can differ slightly. What all of the above means for Prometheus is that there is a problem when a single Prometheus instance is not able to scrape the entire infrastructure anymore. This is where Prometheus' sharding feature comes into play. It divides the targets Prometheus scrapes into multiple groups, small enough for a single Prometheus instance to scrape. If possible functional sharding is recommended. What is meant by functional sharding is that all instances of Service A are being scraped by Prometheus A" https://github.com/prometheus-operator/prometheus-operator/blob/\ 02a5bac9610299372e9f77cbbe0c824ce636795b/Documentation/high-availability.md#prometheus Not much docs on enabling sharding besides this issue prometheus-operator/prometheus-operator#3130 (comment) and the PR prometheus-operator/prometheus-operator#3241 Signed-off-by: Simão Reis <[email protected]>

brancz mentioned this pull request May 25, 2020

Horizontal scaling via sharding #3130

Closed

simonpasquier reviewed May 25, 2020

View reviewed changes

brancz changed the title ~~prometheus: Implement sharding mechanism~~ WIP: prometheus: Implement sharding mechanism May 26, 2020

brancz force-pushed the sharding branch 2 times, most recently from 7c23a3b to 051dbfe Compare May 26, 2020 17:50

coryschwartz mentioned this pull request Jun 3, 2020

shard kube-state-metrics testground/infra#44

Closed

brancz force-pushed the sharding branch 2 times, most recently from 99b4e65 to ef1afc2 Compare June 15, 2020 08:09

brancz changed the title ~~WIP: prometheus: Implement sharding mechanism~~ prometheus: Implement sharding mechanism Jun 15, 2020

brancz force-pushed the sharding branch 3 times, most recently from a249f1d to eaeebe3 Compare July 3, 2020 14:42

brancz force-pushed the sharding branch from eaeebe3 to 02cc6ad Compare July 16, 2020 08:45

brancz force-pushed the sharding branch from 0e3b46f to 9d015c7 Compare August 13, 2020 08:03

brancz requested a review from a team as a code owner August 13, 2020 08:03

brancz requested review from s-urbaniak and removed request for a team August 13, 2020 08:03

brancz force-pushed the sharding branch 3 times, most recently from 9f21065 to 5da804c Compare August 13, 2020 09:03

s-urbaniak reviewed Aug 17, 2020

View reviewed changes

lilic reviewed Sep 7, 2020

View reviewed changes

Documentation/api.md Show resolved Hide resolved

brancz mentioned this pull request Sep 23, 2020

Horizontal scaling in Prometheus Operator #3528

Closed

brancz force-pushed the sharding branch 5 times, most recently from 2f3aecb to 5510299 Compare November 1, 2020 10:45

brancz force-pushed the sharding branch 2 times, most recently from 7d30c78 to bb9755e Compare November 1, 2020 16:48

brancz force-pushed the sharding branch from bb9755e to a6f0c9a Compare November 2, 2020 08:28

lilic reviewed Nov 2, 2020

View reviewed changes

prometheus: Implement sharding mechanism

0882383

brancz force-pushed the sharding branch from a6f0c9a to 0882383 Compare November 2, 2020 10:11

lilic approved these changes Nov 2, 2020

View reviewed changes

brancz merged commit 7daed4b into prometheus-operator:master Nov 2, 2020

brancz deleted the sharding branch November 2, 2020 11:32

sreis mentioned this pull request Dec 23, 2020

chore: enable prometheus sharding opstrace/opstrace#215

Closed

dashpole mentioned this pull request Jan 15, 2021

Prometheus metrics experiment: Sharding and receiver monitoring dashpole/opentelemetry-collector#20

Closed

s-urbaniak mentioned this pull request Feb 15, 2021

Parameterize sharding key(s) #3838

Closed

thobianchi mentioned this pull request Mar 13, 2021

[kube-prometheus-stack] sharding option set but prometheuses not starting prometheus-community/helm-charts#765

Closed

paulfantom mentioned this pull request Jun 7, 2021

Support targets sharding #2590

Closed

alizdavoodi mentioned this pull request Oct 27, 2022

Sharding - have I missed anything? #4958

Closed

prometheus: Implement sharding mechanism #3241

prometheus: Implement sharding mechanism #3241

Conversation

brancz commented May 25, 2020 • edited Loading

brancz commented May 25, 2020

vsliouniaev commented May 25, 2020

brancz commented May 25, 2020

simonpasquier left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brancz commented May 26, 2020

temujin9 commented Jun 1, 2020

brancz commented Jun 15, 2020

lilic commented Jun 17, 2020

lilic commented Jul 10, 2020

s-urbaniak commented Aug 4, 2020

brancz commented Aug 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brancz commented Sep 2, 2020

lilic left a comment

Choose a reason for hiding this comment

mcavoyk commented Oct 28, 2020

brancz commented Nov 1, 2020

brancz commented Nov 1, 2020

brancz commented Nov 2, 2020

lilic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lilic Nov 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s-urbaniak commented Nov 2, 2020

lilic left a comment

Choose a reason for hiding this comment

brancz commented May 25, 2020 •

edited

Loading

lilic Nov 2, 2020 •

edited

Loading