Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus: Implement sharding mechanism #3241

Merged
merged 1 commit into from
Nov 2, 2020

Conversation

brancz
Copy link
Contributor

@brancz brancz commented May 25, 2020

This implements the long overdue sharding mechanism that allows for easily sharding a Prometheus cluster.

This is fully backward compatible, but slightly changes the significance of the replicas field in the spec. Instead of being blindly copied from the Prometheus spec to the StatefulSet spec, now the replicas in the StatefulSet are calculated by: # of shards * # of replicas, essentially meaning that .spec.replicas is now "how many replicas of each shard to create".

This is still fully contained in one StatefulSet. An instance knows it's shard by looking at its ordinal within the statefulset. An example of a 2 replicas and 2 shards configuration would result in the following pods:

prometheus-0 -> shard 0
prometheus-1 -> shard 0
prometheus-2 -> shard 1
prometheus-3 -> shard 1

Edit Nov 2:

This creates a statefulset per shard, the "0" shard defaulting to the naming before sharding was introduced to make everything backward compatible. An example of a 2 replicas and 2 shards configuration would result in the following pods:

  • prometheus-0 -> shard 0, replica 0
  • prometheus-1 -> shard 0, replica 1
  • prometheus-shard-1-0 -> shard 1, replica 0
  • prometheus-shard-1-1 -> shard 1, replica 1

This and various other configurations are present in the unit tests.

As noted in the comments, this does not take care of any resharding work, it's only about sharding the scrape work.

Closes #3130 #2590

@lilic @pgier @simonpasquier @s-urbaniak @paulfantom @metalmatze

@brancz
Copy link
Contributor Author

brancz commented May 25, 2020

If everyone is happy with this approach and design I'll happily add design and user docs for this.

@vsliouniaev
Copy link
Contributor

Wow! This would probably help us quite a bit!

I'm wondering how a single statefulset will work with node drains and pod disruption budgets - is it possible that a whole shard will go down simultaneously?

@brancz
Copy link
Contributor Author

brancz commented May 25, 2020

Pod spread and PDB are great points, I need to think about those a bit more. It wouldn't be difficult to extract things into separate statefulsets. The logic is roughly the same (maybe even easier as it wouldn't be ordinal based). For shards==1 we would just do exactly what we do today and only for shards>1 would we create more.

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

{Key: "modulus", Value: shards},
{Key: "action", Value: "hashmod"},
}, yaml.MapSlice{
{Key: "source_labels", Value: []string{"__hash"}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) __tmp_hash instead of __hash as per Prometheus documentation on relabeling?

@brancz
Copy link
Contributor Author

brancz commented May 26, 2020

I'm going to try and explore ways to do this with multiple statefulsets, as with a single statefulset we can't in a meaningful way perform pod spreading and PodDisruptionBudget. Thanks to @vsliouniaev for pointing this out. Will put this in WIP until I have something that can be reviewed.

@brancz brancz changed the title prometheus: Implement sharding mechanism WIP: prometheus: Implement sharding mechanism May 26, 2020
@brancz brancz force-pushed the sharding branch 2 times, most recently from 7c23a3b to 051dbfe Compare May 26, 2020 17:50
@temujin9
Copy link

temujin9 commented Jun 1, 2020

@brancz We currently shard with stock Prometheus, and want to move to the operator. Following this ticket, please let me know if I can help with testing anything.

@brancz brancz force-pushed the sharding branch 2 times, most recently from 99b4e65 to ef1afc2 Compare June 15, 2020 08:09
@brancz brancz changed the title WIP: prometheus: Implement sharding mechanism prometheus: Implement sharding mechanism Jun 15, 2020
@brancz
Copy link
Contributor Author

brancz commented Jun 15, 2020

Gave this another attempt. Now each statefulset represents a shard and replicas just sets the replicas in each of those shards. There are e2e tests and they all pass, so I think this is ready for first rounds of reviews! :)

@lilic
Copy link
Contributor

lilic commented Jun 17, 2020

Can you rebase, thanks!

@brancz brancz force-pushed the sharding branch 3 times, most recently from a249f1d to eaeebe3 Compare July 3, 2020 14:42
@lilic
Copy link
Contributor

lilic commented Jul 10, 2020

continuous-integration/travis-ci Expected — Waiting for status to be reported

Seems strange, either travis or github failed, can you push again to retrigger, thanks!

@s-urbaniak
Copy link
Contributor

i think this is a legit failure:

 --- FAIL: TestAllNS/y/ShardingProvisioning

@brancz
Copy link
Contributor Author

brancz commented Aug 4, 2020

Yeah this is definitely a legit failure, I need to just find time to finish this up :)

@brancz brancz requested a review from a team as a code owner August 13, 2020 08:03
@brancz brancz requested review from s-urbaniak and removed request for a team August 13, 2020 08:03
@brancz brancz force-pushed the sharding branch 3 times, most recently from 9f21065 to 5da804c Compare August 13, 2020 09:03
@@ -493,7 +493,8 @@ PrometheusSpec is a specification of the desired behavior of the Prometheus clus
| image | Image if specified has precedence over baseImage, tag and sha combinations. Specifying the version is still necessary to ensure the Prometheus Operator knows what version of Prometheus is being configured. | *string | false |
| baseImage | Base image to use for a Prometheus deployment. Deprecated: use 'image' instead | string | false |
| imagePullSecrets | An optional list of references to secrets in the same namespace to use for pulling prometheus and alertmanager images from registries see http://kubernetes.io/docs/user-guide/images#specifying-imagepullsecrets-on-a-pod | [][v1.LocalObjectReference](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#localobjectreference-v1-core) | false |
| replicas | Number of instances to deploy for a Prometheus deployment. | *int32 | false |
| replicas | Number of replicas of each shard to deploy for a Prometheus deployment. Number of replicas multiplied by shards is the total number of Pods created. | *int32 | false |
| shards | Number of shards to distribute targets onto. Number of replicas multiplied by shards is the total number of Pods created. Note that scaling down shards will not reshard data onto remaining instances, it must be manually moved. Increasing shards will not reshard data either but it will continue to be available from the same instances. To query globally use Thanos sidecar and Thanos querier or remote write data to a central location. | *int32 | false |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth mentioning that we use __address__ label as the static shard key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes makes sense.

@brancz
Copy link
Contributor Author

brancz commented Sep 2, 2020

  1. Choosing __address__ statically was simply a matter of not exposing configuration until there is a very good reason to do so. There is no practical reason why any two targets should have the same address (with the very rare exception of multiple metrics endpoints on the same listener, but even that should disappear in the distribution). I'm happy to reconsider this if we have a sound use case, but wanted to start with the least amount of configuration necessary.
  2. Agreed, some documentation is deserved here. I would think a user guide would be best, as the API docs already explain the feature, or do you think it would make more sense as part of the design doc?

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind rebasing, thanks!

Documentation/api.md Show resolved Hide resolved
@mcavoyk
Copy link

mcavoyk commented Oct 28, 2020

@brancz @s-urbaniak Besides rebasing, is there any other work needed to move this PR forward? I'm very interested in trying out this feature 😄

@brancz brancz force-pushed the sharding branch 5 times, most recently from 2f3aecb to 5510299 Compare November 1, 2020 10:45
@brancz
Copy link
Contributor Author

brancz commented Nov 1, 2020

This should be ready for review again. The rebase was a bit messy, so I will also review this myself (but unit and e2e tests are passing so that gives me some degree of confidence :) ).

@brancz brancz force-pushed the sharding branch 2 times, most recently from 7d30c78 to bb9755e Compare November 1, 2020 16:48
@brancz
Copy link
Contributor Author

brancz commented Nov 1, 2020

I did a round of fixes, I think this should be good for everyone to review :)

@brancz
Copy link
Contributor Author

brancz commented Nov 2, 2020

I also updated the PR description to reflect the latest implementation state.

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work, mainly nits and questions from my side, nothing blocking! 👏

pkg/prometheus/operator.go Outdated Show resolved Hide resolved
pkg/prometheus/operator_test.go Show resolved Hide resolved
@@ -1137,6 +1153,19 @@ func getLimit(user uint64, enforced *uint64) uint64 {
return user
}

func addressSharding(relabelings []yaml.MapSlice, shards int32) []yaml.MapSlice {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Seems like a confusing function name to me, had to look at the function to understand what it did, maybe something more self-descriptive like injectSharding or generateSharding wdyt?

Would a unit test be good here for this function to cover some possible edge cases, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, something with "generate" in the function name would make more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are not really any edge cases here though that aren't covered elsewhere already, as this is really only text templating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming to generateAddressShardingRelabelingRules

)

func expectedStatefulSetShardNames(
p *monitoringv1.Prometheus,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Do we need this format, it only has one var, we can have it all in one line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to prefer this, as parameters tend to be ever-growing. No strong opinion though.

@@ -871,6 +902,14 @@ func prefixedName(name string) string {
return fmt.Sprintf("prometheus-%s", name)
}

func prometheusName(name string, shard int32) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could we add this function closer to the one where it's used, or vice versa, we only seem to use it in expectedStatefulSetShardNames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will do. I'll also rename it to prometheusNameByShard, which makes it a bit clearer what it does.

t.Fatal(err)
}

err = wait.Poll(time.Second, 1*time.Minute, func() (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We timeout after 1 second?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather after 1 minute no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and poll once per second

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function signature is:

func Poll(interval, timeout time.Duration, condition ConditionFunc) error {

So we're checking once every second for 1 minute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me github said its using:

func (f *Framework) Poll(timeout, pollInterval time.Duration, pollFunc func() (bool, error)) error

func (f *Framework) Poll(timeout, pollInterval time.Duration, pollFunc func() (bool, error)) error {

Copy link
Contributor

@lilic lilic Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like github mistook it, we are indeed using Poll from "k8s.io/apimachinery/pkg/util/wait" and not the one from our helpers.

Side note: Any reason for this? If we don't find our own helper useful and we just want to go with the apimachinery one might be nice to open issue to get rid of the custom one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like framework.Poll is actually unused. How about we remove it in a follow up?

test/e2e/prometheus_test.go Outdated Show resolved Hide resolved
test/e2e/prometheus_test.go Show resolved Hide resolved
test/framework/service.go Outdated Show resolved Hide resolved
test/framework/pod.go Outdated Show resolved Hide resolved
@s-urbaniak
Copy link
Contributor

LGTM modulo @lilic 's review comments, thank you! 🎉

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

🎉

@brancz brancz merged commit 7daed4b into prometheus-operator:master Nov 2, 2020
@brancz brancz deleted the sharding branch November 2, 2020 11:32
sreis added a commit to opstrace/opstrace that referenced this pull request Dec 23, 2020
"To run Prometheus in a highly available manner, two (or more) instances need
to be running with the same configuration, that means they scrape the same
targets, which in turn means they will have the same data in memory and on
disk, which in turn means they are answering requests the same way. In reality
this is not entirely true, as the scrape cycles can be slightly different, and
therefore the recorded data can be slightly different. This means that single
requests can differ slightly. What all of the above means for Prometheus is
that there is a problem when a single Prometheus instance is not able to scrape
the entire infrastructure anymore. This is where Prometheus' sharding feature
comes into play. It divides the targets Prometheus scrapes into multiple
groups, small enough for a single Prometheus instance to scrape.  If possible
functional sharding is recommended. What is meant by functional sharding is
that all instances of Service A are being scraped by Prometheus A"

https://github.com/prometheus-operator/prometheus-operator/blob/\
02a5bac9610299372e9f77cbbe0c824ce636795b/Documentation/high-availability.md#prometheus

Not much docs on enabling sharding besides this issue
prometheus-operator/prometheus-operator#3130 (comment)
and the PR
prometheus-operator/prometheus-operator#3241

Signed-off-by: Simão Reis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Horizontal scaling via sharding
7 participants