Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] PoC for kibana instrumentation using opentelemetry metrics sdk #128755

Closed
5 tasks done
matschaffer opened this issue Mar 29, 2022 · 14 comments
Closed
5 tasks done
Assignees
Labels
Feature:Stack Monitoring Platform Observability Platform Observability WG issues https://github.com/elastic/observability-dev/issues/2055 Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services

Comments

@matschaffer
Copy link
Contributor

matschaffer commented Mar 29, 2022

We discussed a number of possible implementations for ongoing kibana instrumentations in (internal) https://github.com/elastic/observability-dev/issues/2054

In this issue we'll build a proof of concept for how that might work.

Here are the two options we'd like to PoC on. They should both be very similar at the code level, the main difference is the collection mechanism (pull from metricbeat vs push to apm-server).

option 2: OpenTelemetry Metrics API prometheus endpoint with Elastic Agent prometheus input

Here we use the official otel metrics sdk and expose that via prometheus protocol for elastic-agent to poll via the underlying metricbeat prometheus module.

graph LR

subgraph ElasticDeployment["Elastic Deployment"]
  subgraph kibana
    OtelMetricsSDK["Otel Metrics SDK"]
    OtelMetricsPrometheusExporter["/metrics (prometheus-protocol)"]
    OtelMetricsSDK-->OtelMetricsPrometheusExporter

    click OtelMetricsSDK "https://opentelemetry.io/docs/instrumentation/js/getting-started/nodejs/#metrics"
  end

  subgraph elastic-agent
    Metricbeat["metrics/prometheus"]
  end

  Metricbeat-->|"poll (prometheus protocol)"|OtelMetricsPrometheusExporter
  Metricbeat-->|_bulk|elasticsearch
end
Loading

option 3: OpenTelemetry Metrics API exported as OpenTelemetry Protocol

Here we use the official otel metrics sdk and push that via OpenTelemetry Protocol. OpenTelemetry Protocol is natively supported by Elastic APM so we use that to receive the data. There are some caveats for otel collection, but none of them should hinder the collection of platform observability metrics today.

Ideally this apm-server is managed by elastic-agent, but that work is still TBD. See 2022-01 - Elastic Agent Pipeline Runtime Environment for latest info.

graph LR

subgraph ElasticDeployment["Elastic Deployment"]
  subgraph kibana
    OtelMetricsSDK["Otel Metrics SDK"]
  end

  subgraph elastic-agent
    APMServer["apm-server"]
  end

  OtelMetricsSDK-->|"push (OTLP)"|APMServer["apm-server"]
  APMServer-->|_bulk|elasticsearch
end
Loading

Some consumers to keep in mind (see internal companion issue):

  • Stack Monitoring
  • High Level Health API
  • APM instrumentation of stack
  • Telemetry (Event based telemetry) - could maybe leave this as it's own entity, the above are more critical to align

Steps

AC: Recording of PoC as walkthrough

@matschaffer matschaffer added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Mar 29, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@matschaffer
Copy link
Contributor Author

This should go in @elastic/infra-monitoring-ui cycle 9 once it gets created.

@matschaffer
Copy link
Contributor Author

Reposting an early diagram from @chrisronline on how the kibana internal API might look.

Screen Shot 2022-04-12 at 9 24 43 AM

@chrisronline
Copy link
Contributor

chrisronline commented Apr 14, 2022

Love this effort!

I'll just add some thoughts about the parts in the diagram above.

I don't have a strong opinion on the options listed in this issue, but I do want to stress the desire to make writes and reads as easy as possible for Kibana plugin owners. In an ideal world, they are able to either directly use some open telemetry SDK to write metrics (in my example above, I abstracted this detail away by adding write apis to the monitoring_collection plugin but maintaining the consistent terminology as it is standardized) and then they have some easy way to read the metrics back and show them inside of their UIs - keep in mind that the location of the data could be on a separate cluster and plugin owners do need to know this in order to read the data back.

The other part of this that I think is important to mention is how the Stack Monitoring plugin evolves as a result of this - IMO, it should turn into a pure read plugin that subscribes to the same read APIs that other Kibana plugins do. It still has a significant purpose because it is the place where users will see metrics at a birds-eye view, which is very helpful in understanding correlation of problems.

I know these things are probably in everyone's mind around this effort, but I don't see it mentioned explicitly so I want to ensure we have a plan for this too

@cyrille-leclerc
Copy link
Contributor

have some easy way to read the metrics back and show them inside of their UIs - keep in mind that the location of the data could be on a separate cluster and plugin owners do need to know this in order to read the data back.

Did you consider an abstraction so that plugin authors would just have a read API on observability data and the location of these data (local versus remote Elasticsearch) would be injected by the "Platform Observability" configuration?

@chrisronline
Copy link
Contributor

Did you consider an abstraction so that plugin authors would just have a read API on observability data and the location of these data (local versus remote Elasticsearch) would be injected by the "Platform Observability" configuration?

Exactly what I think we should do - in my model above, that's the other purpose of monitoring_collection. It serves a write abstraction (we could remove this if folks aren't a fan) and a read abstraction, allowing for a singe point of configuration.

Now, how that configuration gets there is another story. Following the stack monitoring path, we'd just need to document the need to configure it appropriately but maybe there is something fancy that Elastic agent can do here - I'm not well versed in that area.

@matschaffer matschaffer self-assigned this May 30, 2022
@matschaffer
Copy link
Contributor Author

For approach, I'm planning to try to replicate #123726 for a good metric comparison. If there's anything in that new response ops work that we can't do with the otel metricspace, we should highlight it as early as possible.

@matschaffer
Copy link
Contributor Author

Noting that open-telemetry/opentelemetry-js#2929 is merged, so we may be able to use a >0.27 version here. That was the PR blocking grpc support in the 0.28 release. Current as of writing is 0.29.

@matschaffer
Copy link
Contributor Author

matschaffer commented May 31, 2022

So I got some data coming from something along side

this.inMemoryMetrics.increment(IN_MEMORY_METRICS.RULE_EXECUTIONS);

Issues so far:

  • counter isn't incrementing, I'm just getting a ton of "1"s reported. Thinking I need to move counter initialization higher into the plugin initialization.
  • I have this.metrics.ruleExecutions.add(1, { rule: this.ruleType.id }); set, but it's not coming through as a a label. I'll try my demo app to make sure this isn't a bug in apm-server 8.2.2

Screen Shot 2022-05-31 at 14 18 59

@matschaffer
Copy link
Contributor Author

Yeah, definitely need to move metric creation up. I put it in the TaskRunner constructor but looks like that probably gets created once for each rule evaluation.

@matschaffer
Copy link
Contributor Author

Winning! I'll open up an initial PR so folks can play a little.

Screen Shot 2022-05-31 at 15 28 30

@matschaffer
Copy link
Contributor Author

matschaffer commented May 31, 2022

Doc counts still seem really high. Not sure what's up with that.

update apm-server delivers once a minute with event.ingested reflecting the otel interval. The above screenshots are by @timestamp, so at least 6 docs per counter per minute.

@matschaffer
Copy link
Contributor Author

matschaffer commented Jun 14, 2022

Success!

This is option 3 running in ESS by adding this to the kibana configuration:

monitoring_collection.opentelemetry.metrics:
  otlp:
    url: "https://MY-MONITORING-CLUSTER.apm.us-west2.gcp.elastic-cloud.com"
    headers:
      Authorization: "Bearer REDACTED"
  prometheus.enabled: true

Screen Shot 2022-06-14 at 16 18 15

The prometheus endpoint is active too:

Screen Shot 2022-06-14 at 16 20 27

I'm trying to see if I can get the ESS-included agent polling it, but not sure if that's possible. Might have to attach a self-managed agent.

@matschaffer
Copy link
Contributor Author

We have a demo & notes posted internally (https://drive.google.com/file/d/1uAOvX9IXi5Y3D2QhrMu2pMm8yplxXxbn/view?usp=sharing) which I think meets the acceptance criterial for this issue.

The PoC PR is still open and I'll open new issues to work toward merging it as the conversation evolves.

@matschaffer matschaffer added the Platform Observability Platform Observability WG issues https://github.com/elastic/observability-dev/issues/2055 label Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Stack Monitoring Platform Observability Platform Observability WG issues https://github.com/elastic/observability-dev/issues/2055 Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services
Projects
None yet
Development

No branches or pull requests

4 participants