Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observability: Envoy Proxy #699

Closed
arkodg opened this issue Nov 4, 2022 · 17 comments
Closed

Observability: Envoy Proxy #699

arkodg opened this issue Nov 4, 2022 · 17 comments
Assignees
Labels
area/observability Observability related issues documentation Improvements or additions to documentation kind/enhancement New feature or request
Milestone

Comments

@arkodg
Copy link
Contributor

arkodg commented Nov 4, 2022

Description:
This epic tracks the design and implementation for surfacing logs, metrics and traces generated by Envoy Proxy that are relevant to the user personas

@arkodg arkodg added the kind/enhancement New feature or request label Nov 4, 2022
@arkodg arkodg added this to the Backlog milestone Nov 4, 2022
@arkodg arkodg added the help wanted Extra attention is needed label Nov 4, 2022
@arkodg
Copy link
Contributor Author

arkodg commented Nov 4, 2022

related PR for metrics #581

@arkodg
Copy link
Contributor Author

arkodg commented Nov 4, 2022

related PR for traces #697

@arkodg arkodg added the priority/low Label used to express the "low" priority level label Nov 4, 2022
@zirain
Copy link
Member

zirain commented Nov 5, 2022

best practices form prometheus: https://prometheus.io/docs/practices/naming/#metric-and-label-naming

@zirain
Copy link
Member

zirain commented Nov 5, 2022

accesslog: I think should focus on File(send /dev/stdout) and OpenTelemetry(I believe it's the future)
traces: only support OpenTelemetry is an options

@Xunzhuo Xunzhuo added area/observability Observability related issues and removed help wanted Extra attention is needed labels Nov 5, 2022
@Xunzhuo Xunzhuo self-assigned this Nov 5, 2022
@github-actions
Copy link

github-actions bot commented Dec 5, 2022

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Dec 5, 2022
@github-actions
Copy link

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

@Xunzhuo Xunzhuo reopened this Dec 12, 2022
@Xunzhuo Xunzhuo removed the stale label Dec 12, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Jan 11, 2023
@danehans
Copy link
Contributor

@Xunzhuo unless you're actively working on this issue, can you unassign yourself and add the help-wanted label?

@danehans danehans removed the stale label Jan 11, 2023
@Xunzhuo Xunzhuo removed their assignment Jan 11, 2023
@Xunzhuo Xunzhuo added the help wanted Extra attention is needed label Jan 11, 2023
@arkodg
Copy link
Contributor Author

arkodg commented Apr 12, 2023

should we configure envoy so it sends OLTP format to otel-collector and the user configures EG API to specify the Otel-collector endpoint (user (or docs) brings their otel-collector)
There seems to be support for access log and traces and see a PR for metrics support ?

@arkodg
Copy link
Contributor Author

arkodg commented Apr 12, 2023

are there any existing grafana dashboards we can reuse to showcase metrics ?

@zirain
Copy link
Member

zirain commented Apr 29, 2023

I made two proposals before:

  1. via EnvoyProxy proposal: support listener access log #697
  2. via PolicyAttachment AccessLoggingPolicy Design #1121

so I think this time just add configuration in EnvoyGateway?

cc @arkodg @AliceProxy @kflynn @Xunzhuo

@arkodg
Copy link
Contributor Author

arkodg commented Apr 30, 2023

@zirain I vote for defining Data Plane Observability/Telemetry within the EnvoyProxy resource . Your PRs should gain more traction and reviews in the coming weeks, now that Observability is the theme for the next release.
Also request to add a design doc highlighting the - why, what and how.
Tackling one signal at a time (access logging first, then the rest) should also speed up the review process

@zirain
Copy link
Member

zirain commented May 4, 2023

ok, I will move forward #697

@arkodg
Copy link
Contributor Author

arkodg commented Jul 26, 2023

lets keep this issue open to track completion of docs
moving this to v0.5

@LanceEa
Copy link
Contributor

LanceEa commented Jul 31, 2023

@arkodg @zirain - I was testing some of this out. I just wanted to record a couple of findings that might help with documentation.

First, the OTEL Stats Sink didn't land until Envoy 1.27 so if trying to test EG against an older version of Envoy then it will fail. This kind of obvious and with the EG policy but just mention it in case anyone else had pinned an older version of Envoy and expected it to work. Might be a good call out in the docs, thoughts? happy to throw up a docs PR.

Second, I'm testing metrics.prometheus using the PrometheusOperator and ran into a few road bumps. By default, no Service exposes the 19001 port so that means ServiceMonitor will not work. I tried PodMonitor to scrape the pods directly but that didn't like the fact that the Envoy Proxy Deployment doesn't expose the 19001 port (only the gateway.listener ports) for scraping metrics directly from the EnvoyProxy pods.

I ended up adding my own Service that exposed the 19001 port to the cluster and used a ServiceMonitor and it started to work.

So, I guess my questions are:

  1. Am I completely missing something or a document already outlining this 😄 (totally possible)?
  2. Do we want to consider this a bug and address it before 0.5.0 lands? I would think that the InfraManager should inject the port into the Deployment and probably expose a second Service of type ClusterIP.
  3. If we do not have time to address it for 0.5.0 and there are not docs yet then can we add a little snippet about this temporary work-around?

cc. @AliceProxy

@arkodg
Copy link
Contributor Author

arkodg commented Jul 31, 2023

thanks for raising these issues @LanceEa, my first reaction is that I'd consider resolution of above questions to be release blockers for v0.5.0, deferring to @zirain who might have some WIP stashed

@arkodg arkodg removed priority/low Label used to express the "low" priority level help wanted Extra attention is needed labels Jul 31, 2023
@kflynn
Copy link
Contributor

kflynn commented Jul 31, 2023

FTR, I concur that the port issue should be a 0.5.0 blocker. I think the Envoy-prior-to-1.27 thing is worth documenting, but not a blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/observability Observability related issues documentation Improvements or additions to documentation kind/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants