-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
self-deployed rule-evaluator "One or more TimeSeries could not be written" #1083
Comments
I believe this statement points out your issue: if you don't have a project field in your recording rule, each project is writing to the same time-series. Hence, you are having collisions because to Google Cloud Monitoring, it looks like you're writing to the same time-series more frequently than 60 seconds. For example, if you have 2 projects each writing that metrics, it's as if you're sending the time-series every 30 seconds (once for each project over your 60 period interval). The public documentation points out that adding some of those labels is the correct solution here: https://cloud.google.com/stackdriver/docs/managed-prometheus/troubleshooting#timeseries-collisions I believe this is working as designed. Before you migrated to GMP, your metrics were likely local to your cluster so there were no collisions. With GMP, your metrics are global so this is something you would have to update.
Can you confirm whether you used a different combination in each project, or did you apply the same config to each project? |
Also see the sections "Multi-project and global rule evaluation" and "preserve labels when writing rules": https://cloud.google.com/stackdriver/docs/managed-prometheus/rules-unmanaged#multi-project_and_global_rule_evaluation tl;dr you can have one rule evaluator that aggregates across all projects and writes the output to one project, or you can have one rule eval per project that preserves the project ID. But having multiple rule evals that don't preserve project ID will almost certainly lead to collisions. |
when using the another thing perhaps worth pointing out, I was receiving this error for recording rules that was only missing On your point:
Just want to make sure I am on the same page, Is this the case when we have a single rule evaluator deployed in each project? (ie project-a has it's own rule evaluator, project-b has it's own etc.) |
Yes, in that case you'll have conflicts, which is where @lyanco's suggestion helps.
Was it the same error message in this case about the sampling period? If it was then I wonder if the problem is just that the label has to be set. Maybe @pintohutch would know? |
Yeah the error is the same. Coincidentally, the error popped up again in the past 30 mins for 5 recording rules. This is the first occurrence since I've added the labels last week. The volume is much smaller (5 today, whereas there was 480k over the course of the last 7d with the error stopping on Thursday). There hasn't been any change to the rules since. As per the docs:
Is it sufficient to preserve the labels when aggregating? (ie sum by (project_id, namespace, location, ...) or is there a preferred approach such as setting the labels via the |
Preserving while aggregating is not just sufficient, it's recommended! :-) |
Automatic grouping to avoid resource collisions is one of the things the managed rule evaluator does automatically, FWIW. This is just one of those things that has to be done when you move from server-scoped rules (OSS Prom) to global-scoped rules (GMP). Having time series in different clusters/namespaces that are identical except for the cluster/namespace/project they are in is very common. |
Hi folks. I work on the same team with @jlaberge-league . Thanks for helping us sort through the differences between the OSS Prometheus and GMP! :) The key thing we're trying to understand here is why setting the project_id + location labels via the |
Are the rule evaluators pointing to a metrics scope that fans out to multiple projects? It's possible that they're pulling in data from other projects. Another possibility is that you have identical time series in different namespaces in the same project, so really it's the namespace grouping that's the most important, and can't be set statically by an external label. Since most recording rules aggregate away the uniquely defining resource label |
We're running in GKE, and from the code it looks like by default it restricts the query scope to the project it runs in. This is also what I'm observing in the metrics themselves (they're different by project and their values align with my expectations). Moreover, the issue here is with the export, not the query.
I can confirm that we don't. All the time series are exported from namespaces that are unique per project. |
Right, so if you drop "namespace" then it's possible you lose the uniquely distinguishing value of namespace. If you're collecting the same |
We don't collect the same |
Ah sorry, yes, you're right, you should only see the issue when writing data back, not when reading. And writing data back will only collide if multiple rule evaluators are exporting the same data, not querying the same data. Sorry, usually when this error happens it's due to issues on the collection side, got confused. As I said, I don't know where the conflicts in your metrics might be. But if you preserve the project and location labels, and ideally the cluster and namespace labels too (although there are situations where you intentionally don't want to do that), then all should work. |
Hi, I currently have deployed the self-managed rule-evaluator. We have seen the error shown below for all of our recording rules since our migration to GMP.
This error, multiplied by our number of recording rules, produces several thousand of these errors per hour.
Here is a sample recording rule:
Modifying how we aggregate the metric, to include project_id, location and namespace will make the error go away. However, this means every recording rule we have/create going forward will need to be mindful and include this.
Our rule-evaluator config is relatively simple:
As well as the args we pass at runtime
I have tried leveraging the
external_labels
property here to ensure project_id, location, and namespace are present however that made the error more frequent.Additional context:
If there any additional information that will be helpful, please let me know.
The text was updated successfully, but these errors were encountered: