Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add telemetry #298

Merged
merged 11 commits into from
Jun 8, 2023
Merged

feat: add telemetry #298

merged 11 commits into from
Jun 8, 2023

Conversation

polvalente
Copy link
Contributor

@polvalente polvalente commented Jan 29, 2023

closes #229
related to #299

  • Add Server telemetry
    • publish minimal event set
    • tests
  • Add Client telemetry
    • publish minimal event set
    • tests

"minimal event set" here means the set of events needed to publish the same Prometheus data that
is published by :grpc_prometheus, which is being deprecated due to incompatibilities with Elixir 1.14
and in favor of adding the more portable :telemetry events.

It's worth noting that we aren't measuring exactly the same times because the previous approach would actually depend on where the interceptors were placed in the pipeline.

That can still be mimicked by a user by placing their own interceptor + telemetry event there, following the same ideas that this PR introduces. However, for better documentation and clarity of expectations, we're not providing optional interceptors in favor of having these events always be published.

@polvalente polvalente self-assigned this Jan 29, 2023
@polvalente
Copy link
Contributor Author

Current results from the telemetry+prometheus companion livebook:

# HELP grpc_server_handled_latency_seconds Histogram of response latency of rpcs handled by the server, in seconds.
# TYPE grpc_server_handled_latency_seconds histogram
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.005"} 1
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.01"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.025"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.05"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.1"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.25"} 4
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.5"} 4
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="1"} 5
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="2.5"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="5"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="10"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="+Inf"} 6
grpc_server_handled_latency_seconds_sum{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 1.7203967910000002
grpc_server_handled_latency_seconds_count{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.005"} 1
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.01"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.025"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.05"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.1"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.25"} 4
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.5"} 4
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="1"} 5
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="2.5"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="5"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="10"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="+Inf"} 6
grpc_server_handled_latency_seconds_sum{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 1.71165725
grpc_server_handled_latency_seconds_count{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.005"} 1
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.01"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.025"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.05"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.1"} 2
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.25"} 4
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="0.5"} 4
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="1"} 5
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="2.5"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="5"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="10"} 6
grpc_server_handled_latency_seconds_bucket{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary",le="+Inf"} 6
grpc_server_handled_latency_seconds_sum{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 1.713500874
grpc_server_handled_latency_seconds_count{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 6
# HELP grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure.
# TYPE grpc_server_handled_total counter
grpc_server_handled_total{grpc_code="Unknown",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 6
grpc_server_handled_total{grpc_code="OK",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 6
grpc_server_handled_total{grpc_code="NotFound",grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 6
# HELP grpc_server_started_total Total number of RPCs started on the server
# TYPE grpc_server_started_total counter
grpc_server_started_total{grpc_method="SayHello",grpc_service="helloworld.Greeter",grpc_type="unary"} 18

Some validation from users would be nice, but it looks like we have most if not full retrocompatibility, aside from the unsupported summary metric

@polvalente
Copy link
Contributor Author

To minimize the size of this PR, I'll add client events to the livebook in a follow-up PR :)

@polvalente polvalente marked this pull request as ready for review February 4, 2023 08:58
@polvalente polvalente changed the title [WIP] feat: add telemetry feat: add telemetry Feb 4, 2023
@polvalente
Copy link
Contributor Author

polvalente commented Feb 4, 2023

@beligante @wingyplus reviews are welcome (for anyone else reading this as well!) :)

@beligante
Copy link
Contributor

@polvalente Great Job! Code LGTM, tho, I have some considerations.

You may know this behavior, but server and client codes behave differently for streams.

Server

For server_stream and bidirectional_stream on the server side. The apply/3 that runs on the server needs to stay blocked til the end of the server stream. So, for long-running streams, the duration will be a significant number, but it's not the case for client.

Note: the function needs to be blocked until the stream has ended because, in the current behavior, if the apply function returns, the connection with the client will be brutally closed. Aka: no proper messages will be exchanged.

Client

For clients, the behavior for streams, in any case, is different from the server. Before, I mentioned that the server needs a blocking code. On the other hand, clients have non-blocking/async behavior. Dealing with streams on the client side is transparent because the interactions happen inside the adapters asynchronously or are wrapped inside an Elixir.Stram.


I want to ensure that this behavior is okay with you. Otherwise, we could add a to-do to improve that.

lib/grpc/stub.ex Outdated
)

# re-raise accordingly
case kind do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@polvalente
Copy link
Contributor Author

@beligante regarding the duration for clients, I think that one could aggregate events based on the stream metadata somehow to provide a meaningful number. I need to check what's the current behavior with :grpc_prometheus. If it's the same as the current one in this PR we can move forward as-is, otherwise we can re-work things accordingly.

duration = System.monotonic_time() - t0

:ok =
GRPC.Telemetry.server_rpc_exception(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't :telemetry.span suitable for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:telemetry.span doesn't treat rescue separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I tested locally that raise can be catch-ed as an :error, so it seems that it's indeed equivalent, though I have to check if the only metadata that changes is what span handles :) I'll keep this unresolved til I confirm this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some changes needed anyway, but I went with span because it decouples a bit of the code and leaves basically the formatting and error-case handling to us.

@@ -0,0 +1,282 @@
# Telemetry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. We can now run grpc server inside Livebook? o

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! You need to have the rights to bind the port you'll run the server

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

I believe you can in theory run anything inside livebook if you try hard enough haha

In this case it was luckily not hard at all, and so the livebook both shows the way to use telemetry and works as a manual validation file for developing this PR

@@ -0,0 +1,282 @@
# Telemetry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! You need to have the rights to bind the port you'll run the server

@beligante
Copy link
Contributor

@beligante regarding the duration for clients, I think that one could aggregate events based on the stream metadata somehow to provide a meaningful number. I need to check what's the current behavior with :grpc_prometheus. If it's the same as the current one in this PR we can move forward as-is, otherwise we can re-work things accordingly.

Got it! Makes Sense!

@polvalente
Copy link
Contributor Author

Merging this, with #315 left as a pendency.

@polvalente polvalente merged commit 80a0858 into master Jun 8, 2023
@polvalente polvalente deleted the pv-feat/add-telemetry branch June 8, 2023 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Telemetry support
3 participants