Implement pool telemetry #248

oliveigah · 2023-11-04T22:26:32Z

Second proposal to solve #183

After gather some feedback, I settled down on these 2 structs for the metrics:

   %Finch.HTTP1.PoolMetrics{
     pool_index: 1,
     pool_size: 50,
     available_connections: 40,
     in_use_connections: 10
   }
   
   %Finch.HTTP2.PoolMetrics{
      pool_index: 1,
      in_flight_requests: 4
   }

The Finch.get_pool_status/2 interface also changed a bit. Now it returns a list of metrics struct, one for each pool started defined by the count option. Like this:

Finch.get_pool_status(MyFinch, "https://somehost.com:4321")

{:ok,
 [
   %Finch.HTTP1.PoolMetrics{
     pool_index: 1,
     pool_size: 50,
     available_connections: 40,
     in_use_connections: 10
   },
   %Finch.HTTP1.PoolMetrics{
     pool_index: 2,
     pool_size: 50,
     available_connections: 40,
     in_use_connections: 10
   }
 ]}

I've also added some fields to the pool state (such as finch_name and pool_index) in order to enable future changes more easily.

sneako

Thank you so much for this contribution!

Running the test suite locally on main does require your patch for me as well so that is a welcome addition.

I have added some comments, please let me know what you think

lib/finch.ex

lib/finch/http1/pool_metrics.ex

lib/finch.ex

lib/finch/http1/pool.ex

oliveigah · 2023-11-06T12:59:07Z

I agree with all suggestions! I'm currently on a business trip so probably won't be able to work on this until friday.

Regarding the averages and maxes, I think the potential margin of error is very low when we consider a large number of samples (which should be most cases).

The problem is if you take the measures without enough samples and therefore this margin of error may be more relevant.

One possible workaround would be only serve this measurements if there is enough samples, like 100. In this scenario even if the base metrics are slightly out of sync the real impact on the average/max measurements should not be relevant.

Anyway, I think is better to move on like you suggested with only available and in use connections and we can think more about this problem later.

Thanks for the feedback! (:

aselder · 2023-11-07T01:18:24Z

lib/finch/http1/pool.ex

    pool_timeout = Keyword.get(opts, :pool_timeout, 5_000)
    receive_timeout = Keyword.get(opts, :receive_timeout, 15_000)
-
    metadata = %{request: req, pool: pool}


Since we're already passing the Finch name in, could we add it to the metadata so it's available in telemetry? It would be really useful in monitoring wait time per Finch instance?

Yes exactly @aselder I was thinking these changes enable #199 . I think that should be implemented in a new PR once this is complete.

I think this is a good idea indeed. But just to be clear, I've only added the finch_name on the Finch.Pool request callback because of averages and maxes metrics.

If we gonna move on only with available and in use connections this is not necessary.

Should I keep it here unused anyway in order to enable this change? What you think?

Considering the modifications on #252 I think would be better to remove these changes from this PR in order to avoid conflicts.

oliveigah · 2023-11-15T17:10:19Z

I'm working on the pool metrics implementation for HTTP2 and would like to know which metrics you think would be good to keep track of.

I'm thinking about in flight requests instead of available connections because it makes more sense on a multiplex scenario like HTTP2. There is another metric you think would be good @sneako?

oliveigah · 2023-11-20T14:51:59Z

Regarding finch configurations with a pool count greater than one. Since the multiple pools are not exposed to the user in any way, I think makes sense to aggregate metrics from multiple pools under a single counter and generate metrics as there is only one pool.

An alternative would be keep separate metrics for each started pool using some index strategy and just show it as a unified metric when the user calls Finch.get_pool_status (maybe even have some option to show it separately). This would enable different forms of load balancing such as least_conn when there is more than one pool but the implementation may be harder.

What you think? I'm good with both. Just let me known the approach that better suits your vision for the library @sneako! (:

sneako · 2023-11-20T21:15:12Z

Thanks @oliveigah !

I agree inflight requests make more sense for h2 👍

I think individual pool metrics will be more useful than only exposing the aggregates. I definitely like the idea of enabling new load balancing strategies. Individual metrics will also be more useful when users are configuring their pool sizes & counts. Any aggregation I assume will take place somewhere else in the observability stack (ie Datadog or similar)

oliveigah · 2023-11-25T20:24:58Z

I think the implementation is done. Gonna start to work on the docs now.

In summary:

No more persistent_term call is needed when adding to the metrics, the atomic references are stored inside the pool state
Finch.get_pool_status/2 now returns a list of metrics struct. Each struct contains a pool_index attribute that represents the index of the pool.
I've kept some other attributes on the pool state in order to facilitate future changes. This may causes some conflicts with Add finch instance name to telemetry metadata #252. Gonna wait for the merge to solve it.
The only persistent term that still remains is the pool manager that keep all atomic refs related to a given finch instance and shp, this is used on the Finch.get_pool_status/2 call.
Both h1 and h2 are ready

Let me know what you think @sneako. (:

oliveigah · 2023-11-26T22:44:46Z

No more persistent_term call is needed when adding to the metrics, the atomic references are stored inside the pool state

I've found a bug with this implementation. When the dynamic supervisor restarts the pool, the PoolManager references are not updated, leading to unexpected behaviour.

Gonna work on a fix for this, probably gonna need to store the reference on a persistent term and call it N times when the user call Finch.get_pool_status/2.

To add and subtract from the atomics I still think no persistent term call will be needed but the calls for Finch.get_pool_status/2 will need to execute N calls to persistent term instead of just one.

Maybe an ETS would be a better fit for what Finch.get_pool_status/2 needs. Since it's use case is far less performance sensitive than the add and subtract it may work. Gonna explore it a bit.

oliveigah · 2023-11-27T12:50:54Z

I've found a bug with this implementation. When the dynamic supervisor restarts the pool, the PoolManager references are not updated, leading to unexpected behaviour.

Solved!

I think the PR is ready for review now! I've made some changes to some test functions to address some problems with flaky tests.

Mainly what I did is check for available ports dynamically and reuse the same server for all tests.

Let me know your thoughts! (:

lib/finch.ex

lib/finch/http1/pool.ex

oliveigah · 2023-12-11T01:05:26Z

Done @sneako! (:

Let me know if you see any other improvement to the proposed implementation.

sneako

Thank you @oliveigah!

oliveigah mentioned this pull request Nov 4, 2023

pool telemetry: available workers #183

Open

sneako reviewed Nov 6, 2023

View reviewed changes

lib/finch.ex Outdated Show resolved Hide resolved

lib/finch/http1/pool_metrics.ex Show resolved Hide resolved

lib/finch.ex Outdated Show resolved Hide resolved

lib/finch/http1/pool.ex Outdated Show resolved Hide resolved

aselder reviewed Nov 7, 2023

View reviewed changes

aselder mentioned this pull request Nov 7, 2023

Test failing under OTP26/Elixir 1.15 #249

Closed

arthurnovak mentioned this pull request Nov 14, 2023

Add finch instance name to telemetry metadata #252

Merged

oliveigah closed this Nov 25, 2023

oliveigah force-pushed the main branch from 221bf4a to 109c3d2 Compare November 25, 2023 02:51

feat: rebase from main and implement multi pool metrics for http1

c45123b

oliveigah reopened this Nov 25, 2023

oliveigah added 5 commits November 25, 2023 00:13

chore: mix format

9149b24

chore: remove mimic dependency

91ee8ab

fix: Finch.Pool get_pool_status callback

0e004e8

feat: add multi pool metrics for http2

57a71e9

feat: store atomics ref on pool state instead of persistent_term

fd73f84

oliveigah added 5 commits November 26, 2023 13:59

feat: add auto start option to get_pool

8ab3a9c

docs: add docs

ef7dd23

chore: mix format

9ade72a

chore: remove metrics test from http2 pool test

9782aa4

fix: search available port for http2 test server

dc56bfe

oliveigah added 3 commits November 27, 2023 09:17

fix: dynamic supervisor restart bug

b9f8f73

fix: search available port on tests

89401b1

fix: reuse same http2 server for all tests

be09ce4

oliveigah marked this pull request as ready for review November 27, 2023 12:47

sneako reviewed Dec 4, 2023

View reviewed changes

lib/finch.ex Outdated Show resolved Hide resolved

lib/finch/http1/pool.ex Outdated Show resolved Hide resolved

sneako linked an issue Dec 8, 2023 that may be closed by this pull request

Test failing under OTP26/Elixir 1.15 #249

Closed

chore: pr requested changes

b443c64

sneako approved these changes Jan 7, 2024

View reviewed changes

sneako merged commit 3946acb into sneako:main Jan 7, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement pool telemetry #248

Implement pool telemetry #248

oliveigah commented Nov 4, 2023 •

edited

Loading

sneako left a comment

oliveigah commented Nov 6, 2023

aselder Nov 7, 2023

sneako Nov 7, 2023

oliveigah Nov 11, 2023

oliveigah Nov 15, 2023

oliveigah commented Nov 15, 2023

oliveigah commented Nov 20, 2023

sneako commented Nov 20, 2023

oliveigah commented Nov 25, 2023

oliveigah commented Nov 26, 2023 •

edited

Loading

oliveigah commented Nov 27, 2023

oliveigah commented Dec 11, 2023

sneako left a comment

Implement pool telemetry #248

Implement pool telemetry #248

Conversation

oliveigah commented Nov 4, 2023 • edited Loading

sneako left a comment

Choose a reason for hiding this comment

oliveigah commented Nov 6, 2023

aselder Nov 7, 2023

Choose a reason for hiding this comment

sneako Nov 7, 2023

Choose a reason for hiding this comment

oliveigah Nov 11, 2023

Choose a reason for hiding this comment

oliveigah Nov 15, 2023

Choose a reason for hiding this comment

oliveigah commented Nov 15, 2023

oliveigah commented Nov 20, 2023

sneako commented Nov 20, 2023

oliveigah commented Nov 25, 2023

oliveigah commented Nov 26, 2023 • edited Loading

oliveigah commented Nov 27, 2023

oliveigah commented Dec 11, 2023

sneako left a comment

Choose a reason for hiding this comment

oliveigah commented Nov 4, 2023 •

edited

Loading

oliveigah commented Nov 26, 2023 •

edited

Loading