Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

self-deployed rule-evaluator "One or more TimeSeries could not be written" #1083

Open
jlaberge-league opened this issue Jul 22, 2024 · 13 comments
Assignees

Comments

@jlaberge-league
Copy link

Hi, I currently have deployed the self-managed rule-evaluator. We have seen the error shown below for all of our recording rules since our migration to GMP.

{
  "jsonPayload": {
    "caller": "export.go:946",
    "size": 5,
    "msg": "send batch",
    "level": "error",
    "ts": "2024-07-18T13:56:45.907003101Z",
    "err": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: prometheus_target{location:xxx,cluster:xxx,instance:,job:,namespace:xxx} timeSeries[0-4]: prometheus.googleapis.com/slo:current_burn_rate:ratio/gauge{owner:xxx,tenant_id:xxx,sloth_window:5m,sloth_service:rest,sloth_id:rest-best-effort-availability,sloth_slo:best-effort-availability}\nerror details: name = Unknown  desc = total_point_count:5  success_point_count:2  errors:{status:{code:9}  point_count:3}"
  },
  "timestamp": "2024-07-18T13:56:45.907648378Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/gmp/scrape": "1m",
  },
}

This error, multiplied by our number of recording rules, produces several thousand of these errors per hour.

Here is a sample recording rule:

groups:
  - name: app_overview
    rules:
      - record: route_rate:app_response_count:sum5m
        expr: sum(rate(app_response_count[5m])) by (route)

Modifying how we aggregate the metric, to include project_id, location and namespace will make the error go away. However, this means every recording rule we have/create going forward will need to be mindful and include this.

Our rule-evaluator config is relatively simple:

global:
  external_labels: {} # we have a few labels here
  evaluation_interval: 60s
rule_files: []

As well as the args we pass at runtime

args:
  - "--config.file=/prometheus/config_out/config.yaml"
  - "--web.listen-address=:9092"

I have tried leveraging the external_labels property here to ensure project_id, location, and namespace are present however that made the error more frequent.

Additional context:

  • we deploy a single rule evaluator per project, we have multiple projects, the error occurs in each
  • these metrics all originate from a single namespace, rule-evaluator is deployed to a different namespace
  • we are using the managed collector, the error appears to only come from recording rules.

If there any additional information that will be helpful, please let me know.

@TheSpiritXIII
Copy link
Member

we deploy a single rule evaluator per project, we have multiple projects, the error occurs in each

I believe this statement points out your issue: if you don't have a project field in your recording rule, each project is writing to the same time-series. Hence, you are having collisions because to Google Cloud Monitoring, it looks like you're writing to the same time-series more frequently than 60 seconds. For example, if you have 2 projects each writing that metrics, it's as if you're sending the time-series every 30 seconds (once for each project over your 60 period interval).

The public documentation points out that adding some of those labels is the correct solution here: https://cloud.google.com/stackdriver/docs/managed-prometheus/troubleshooting#timeseries-collisions

I believe this is working as designed. Before you migrated to GMP, your metrics were likely local to your cluster so there were no collisions. With GMP, your metrics are global so this is something you would have to update.

I have tried leveraging the external_labels property here to ensure project_id, location, and namespace are present however that made the error more frequent.

Can you confirm whether you used a different combination in each project, or did you apply the same config to each project?

@lyanco
Copy link
Collaborator

lyanco commented Jul 22, 2024

Also see the sections "Multi-project and global rule evaluation" and "preserve labels when writing rules": https://cloud.google.com/stackdriver/docs/managed-prometheus/rules-unmanaged#multi-project_and_global_rule_evaluation

tl;dr you can have one rule evaluator that aggregates across all projects and writes the output to one project, or you can have one rule eval per project that preserves the project ID. But having multiple rule evals that don't preserve project ID will almost certainly lead to collisions.

@jlaberge-league
Copy link
Author

Can you confirm whether you used a different combination in each project, or did you apply the same config to each project?

when using the external_labels property to set project_id, location and namespace; this was unique per project.

another thing perhaps worth pointing out, I was receiving this error for recording rules that was only missing namespace as well, once adding that to the aggregation the error went away.

On your point:

each project is writing to the same time-series

Just want to make sure I am on the same page, Is this the case when we have a single rule evaluator deployed in each project? (ie project-a has it's own rule evaluator, project-b has it's own etc.)

@TheSpiritXIII
Copy link
Member

Just want to make sure I am on the same page, Is this the case when we have a single rule evaluator deployed in each project?

Yes, in that case you'll have conflicts, which is where @lyanco's suggestion helps.

another thing perhaps worth pointing out, I was receiving this error for recording rules that was only missing namespace as well,

Was it the same error message in this case about the sampling period? If it was then I wonder if the problem is just that the label has to be set. Maybe @pintohutch would know?

@jlaberge-league
Copy link
Author

Was it the same error message in this case about the sampling period? If it was then I wonder if the problem is just that the label has to be set.

Yeah the error is the same.

Coincidentally, the error popped up again in the past 30 mins for 5 recording rules. This is the first occurrence since I've added the labels last week. The volume is much smaller (5 today, whereas there was 480k over the course of the last 7d with the error stopping on Thursday). There hasn't been any change to the rules since.

As per the docs:

We strongly recommend that you write rules so that the project_id, location, cluster, and namespace labels are preserved appropriately for their aggregation level

Is it sufficient to preserve the labels when aggregating? (ie sum by (project_id, namespace, location, ...) or is there a preferred approach such as setting the labels via the labels property or something else?

@lyanco
Copy link
Collaborator

lyanco commented Jul 23, 2024

Preserving while aggregating is not just sufficient, it's recommended! :-)

@lyanco
Copy link
Collaborator

lyanco commented Jul 23, 2024

Automatic grouping to avoid resource collisions is one of the things the managed rule evaluator does automatically, FWIW. This is just one of those things that has to be done when you move from server-scoped rules (OSS Prom) to global-scoped rules (GMP). Having time series in different clusters/namespaces that are identical except for the cluster/namespace/project they are in is very common.

@faevourite
Copy link

Hi folks. I work on the same team with @jlaberge-league . Thanks for helping us sort through the differences between the OSS Prometheus and GMP! :)

The key thing we're trying to understand here is why setting the project_id + location labels via the external_labels config isn't working as we expect. Our thinking is that if we have a rule like sum(rate(app_response_count[5m])) by (route) and the rule-evaluator is configured with project_id: apple, location: us-east4, the resulting time series wouldn't collide with another rule-evaluator in a different project configured with project_id: pear, location: us-central1. It should be fine even if the query itself doesn't aggregate by project_id+location, because those are set as external_labels, right? We're still getting errors in this setup, though. Is this a wrong assumption? If so, is there a way to avoid these collisions via another configuration change rather vs. updating every rule's expression?

@lyanco
Copy link
Collaborator

lyanco commented Jul 24, 2024

Are the rule evaluators pointing to a metrics scope that fans out to multiple projects? It's possible that they're pulling in data from other projects.

Another possibility is that you have identical time series in different namespaces in the same project, so really it's the namespace grouping that's the most important, and can't be set statically by an external label. Since most recording rules aggregate away the uniquely defining resource label instance, having duplicate time series in a project aside from namespace is super common.

@faevourite
Copy link

Are the rule evaluators pointing to a metrics scope that fans out to multiple projects? It's possible that they're pulling in data from other projects.

We're running in GKE, and from the code it looks like by default it restricts the query scope to the project it runs in. This is also what I'm observing in the metrics themselves (they're different by project and their values align with my expectations). Moreover, the issue here is with the export, not the query.

you have identical time series in different namespaces in the same project

I can confirm that we don't. All the time series are exported from namespaces that are unique per project.

@lyanco
Copy link
Collaborator

lyanco commented Jul 25, 2024

I can confirm that we don't. All the time series are exported from namespaces that are unique per project.

Right, so if you drop "namespace" then it's possible you lose the uniquely distinguishing value of namespace.

If you're collecting the same job in multiple namespaces, and you drop namespace and instance (instance being fine to drop as it's a rule), then your data is bound to conflict.

@faevourite
Copy link

We don't collect the same job in multiple namespaces. But even if we were, would that be relevant to this issue? The error is coming from rule-evaluator's export. Even if it was reading duplicated metrics from multiple projects, it's the only rule-evaluator instance running in a project, and the time series it generates are implicitly exported to the same project, no?

@lyanco
Copy link
Collaborator

lyanco commented Jul 26, 2024

Ah sorry, yes, you're right, you should only see the issue when writing data back, not when reading. And writing data back will only collide if multiple rule evaluators are exporting the same data, not querying the same data. Sorry, usually when this error happens it's due to issues on the collection side, got confused.

As I said, I don't know where the conflicts in your metrics might be. But if you preserve the project and location labels, and ideally the cluster and namespace labels too (although there are situations where you intentionally don't want to do that), then all should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants