-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prometheus: Implement sharding mechanism #3241
Conversation
If everyone is happy with this approach and design I'll happily add design and user docs for this. |
Wow! This would probably help us quite a bit! I'm wondering how a single statefulset will work with node drains and pod disruption budgets - is it possible that a whole shard will go down simultaneously? |
Pod spread and PDB are great points, I need to think about those a bit more. It wouldn't be difficult to extract things into separate statefulsets. The logic is roughly the same (maybe even easier as it wouldn't be ordinal based). For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
pkg/prometheus/promcfg.go
Outdated
{Key: "modulus", Value: shards}, | ||
{Key: "action", Value: "hashmod"}, | ||
}, yaml.MapSlice{ | ||
{Key: "source_labels", Value: []string{"__hash"}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) __tmp_hash
instead of __hash
as per Prometheus documentation on relabeling?
I'm going to try and explore ways to do this with multiple statefulsets, as with a single statefulset we can't in a meaningful way perform pod spreading and PodDisruptionBudget. Thanks to @vsliouniaev for pointing this out. Will put this in WIP until I have something that can be reviewed. |
7c23a3b
to
051dbfe
Compare
@brancz We currently shard with stock Prometheus, and want to move to the operator. Following this ticket, please let me know if I can help with testing anything. |
99b4e65
to
ef1afc2
Compare
Gave this another attempt. Now each statefulset represents a shard and replicas just sets the replicas in each of those shards. There are e2e tests and they all pass, so I think this is ready for first rounds of reviews! :) |
Can you rebase, thanks! |
a249f1d
to
eaeebe3
Compare
Seems strange, either travis or github failed, can you push again to retrigger, thanks! |
i think this is a legit failure:
|
Yeah this is definitely a legit failure, I need to just find time to finish this up :) |
9f21065
to
5da804c
Compare
Documentation/api.md
Outdated
@@ -493,7 +493,8 @@ PrometheusSpec is a specification of the desired behavior of the Prometheus clus | |||
| image | Image if specified has precedence over baseImage, tag and sha combinations. Specifying the version is still necessary to ensure the Prometheus Operator knows what version of Prometheus is being configured. | *string | false | | |||
| baseImage | Base image to use for a Prometheus deployment. Deprecated: use 'image' instead | string | false | | |||
| imagePullSecrets | An optional list of references to secrets in the same namespace to use for pulling prometheus and alertmanager images from registries see http://kubernetes.io/docs/user-guide/images#specifying-imagepullsecrets-on-a-pod | [][v1.LocalObjectReference](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.17/#localobjectreference-v1-core) | false | | |||
| replicas | Number of instances to deploy for a Prometheus deployment. | *int32 | false | | |||
| replicas | Number of replicas of each shard to deploy for a Prometheus deployment. Number of replicas multiplied by shards is the total number of Pods created. | *int32 | false | | |||
| shards | Number of shards to distribute targets onto. Number of replicas multiplied by shards is the total number of Pods created. Note that scaling down shards will not reshard data onto remaining instances, it must be manually moved. Increasing shards will not reshard data either but it will continue to be available from the same instances. To query globally use Thanos sidecar and Thanos querier or remote write data to a central location. | *int32 | false | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it worth mentioning that we use __address__
label as the static shard key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes makes sense.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind rebasing, thanks!
@brancz @s-urbaniak Besides rebasing, is there any other work needed to move this PR forward? I'm very interested in trying out this feature 😄 |
2f3aecb
to
5510299
Compare
This should be ready for review again. The rebase was a bit messy, so I will also review this myself (but unit and e2e tests are passing so that gives me some degree of confidence :) ). |
7d30c78
to
bb9755e
Compare
I did a round of fixes, I think this should be good for everyone to review :) |
I also updated the PR description to reflect the latest implementation state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work, mainly nits and questions from my side, nothing blocking! 👏
pkg/prometheus/promcfg.go
Outdated
@@ -1137,6 +1153,19 @@ func getLimit(user uint64, enforced *uint64) uint64 { | |||
return user | |||
} | |||
|
|||
func addressSharding(relabelings []yaml.MapSlice, shards int32) []yaml.MapSlice { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Seems like a confusing function name to me, had to look at the function to understand what it did, maybe something more self-descriptive like injectSharding
or generateSharding
wdyt?
Would a unit test be good here for this function to cover some possible edge cases, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, something with "generate" in the function name would make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are not really any edge cases here though that aren't covered elsewhere already, as this is really only text templating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming to generateAddressShardingRelabelingRules
) | ||
|
||
func expectedStatefulSetShardNames( | ||
p *monitoringv1.Prometheus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Do we need this format, it only has one var, we can have it all in one line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to prefer this, as parameters tend to be ever-growing. No strong opinion though.
pkg/prometheus/statefulset.go
Outdated
@@ -871,6 +902,14 @@ func prefixedName(name string) string { | |||
return fmt.Sprintf("prometheus-%s", name) | |||
} | |||
|
|||
func prometheusName(name string, shard int32) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could we add this function closer to the one where it's used, or vice versa, we only seem to use it in expectedStatefulSetShardNames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, will do. I'll also rename it to prometheusNameByShard
, which makes it a bit clearer what it does.
t.Fatal(err) | ||
} | ||
|
||
err = wait.Poll(time.Second, 1*time.Minute, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We timeout after 1 second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather after 1 minute no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and poll once per second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature is:
func Poll(interval, timeout time.Duration, condition ConditionFunc) error {
So we're checking once every second for 1 minute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me github said its using:
func (f *Framework) Poll(timeout, pollInterval time.Duration, pollFunc func() (bool, error)) error
prometheus-operator/test/framework/helpers.go
Line 140 in a6f0c9a
func (f *Framework) Poll(timeout, pollInterval time.Duration, pollFunc func() (bool, error)) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like github mistook it, we are indeed using Poll from "k8s.io/apimachinery/pkg/util/wait" and not the one from our helpers.
Side note: Any reason for this? If we don't find our own helper useful and we just want to go with the apimachinery one might be nice to open issue to get rid of the custom one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like framework.Poll is actually unused. How about we remove it in a follow up?
LGTM modulo @lilic 's review comments, thank you! 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
🎉
"To run Prometheus in a highly available manner, two (or more) instances need to be running with the same configuration, that means they scrape the same targets, which in turn means they will have the same data in memory and on disk, which in turn means they are answering requests the same way. In reality this is not entirely true, as the scrape cycles can be slightly different, and therefore the recorded data can be slightly different. This means that single requests can differ slightly. What all of the above means for Prometheus is that there is a problem when a single Prometheus instance is not able to scrape the entire infrastructure anymore. This is where Prometheus' sharding feature comes into play. It divides the targets Prometheus scrapes into multiple groups, small enough for a single Prometheus instance to scrape. If possible functional sharding is recommended. What is meant by functional sharding is that all instances of Service A are being scraped by Prometheus A" https://github.com/prometheus-operator/prometheus-operator/blob/\ 02a5bac9610299372e9f77cbbe0c824ce636795b/Documentation/high-availability.md#prometheus Not much docs on enabling sharding besides this issue prometheus-operator/prometheus-operator#3130 (comment) and the PR prometheus-operator/prometheus-operator#3241 Signed-off-by: Simão Reis <[email protected]>
This implements the long overdue sharding mechanism that allows for easily sharding a Prometheus cluster.
This is fully backward compatible, but slightly changes the significance of the
replicas
field in the spec. Instead of being blindly copied from the Prometheus spec to the StatefulSet spec, now the replicas in the StatefulSet are calculated by:# of shards * # of replicas
, essentially meaning that.spec.replicas
is now "how many replicas of each shard to create".This is still fully contained in one StatefulSet. An instance knows it's shard by looking at its ordinal within the statefulset. An example of a 2 replicas and 2 shards configuration would result in the following pods:prometheus-0 -> shard 0prometheus-1 -> shard 0prometheus-2 -> shard 1prometheus-3 -> shard 1Edit Nov 2:
This creates a statefulset per shard, the "0" shard defaulting to the naming before sharding was introduced to make everything backward compatible. An example of a 2 replicas and 2 shards configuration would result in the following pods:
This and various other configurations are present in the unit tests.
As noted in the comments, this does not take care of any resharding work, it's only about sharding the scrape work.
Closes #3130 #2590
@lilic @pgier @simonpasquier @s-urbaniak @paulfantom @metalmatze