-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement pool telemetry #248
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this contribution!
Running the test suite locally on main does require your patch for me as well so that is a welcome addition.
I have added some comments, please let me know what you think
I agree with all suggestions! I'm currently on a business trip so probably won't be able to work on this until friday. Regarding the averages and maxes, I think the potential margin of error is very low when we consider a large number of samples (which should be most cases). The problem is if you take the measures without enough samples and therefore this margin of error may be more relevant. One possible workaround would be only serve this measurements if there is enough samples, like 100. In this scenario even if the base metrics are slightly out of sync the real impact on the average/max measurements should not be relevant. Anyway, I think is better to move on like you suggested with only available and in use connections and we can think more about this problem later. Thanks for the feedback! (: |
pool_timeout = Keyword.get(opts, :pool_timeout, 5_000) | ||
receive_timeout = Keyword.get(opts, :receive_timeout, 15_000) | ||
|
||
metadata = %{request: req, pool: pool} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're already passing the Finch name in, could we add it to the metadata so it's available in telemetry? It would be really useful in monitoring wait time per Finch instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea indeed. But just to be clear, I've only added the finch_name on the Finch.Pool request callback because of averages and maxes metrics.
If we gonna move on only with available and in use connections this is not necessary.
Should I keep it here unused anyway in order to enable this change? What you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the modifications on #252 I think would be better to remove these changes from this PR in order to avoid conflicts.
I'm working on the pool metrics implementation for HTTP2 and would like to know which metrics you think would be good to keep track of. I'm thinking about in flight requests instead of available connections because it makes more sense on a multiplex scenario like HTTP2. There is another metric you think would be good @sneako? |
Regarding finch configurations with a pool count greater than one. Since the multiple pools are not exposed to the user in any way, I think makes sense to aggregate metrics from multiple pools under a single counter and generate metrics as there is only one pool. An alternative would be keep separate metrics for each started pool using some index strategy and just show it as a unified metric when the user calls What you think? I'm good with both. Just let me known the approach that better suits your vision for the library @sneako! (: |
Thanks @oliveigah ! I agree inflight requests make more sense for h2 👍 I think individual pool metrics will be more useful than only exposing the aggregates. I definitely like the idea of enabling new load balancing strategies. Individual metrics will also be more useful when users are configuring their pool sizes & counts. Any aggregation I assume will take place somewhere else in the observability stack (ie Datadog or similar) |
I think the implementation is done. Gonna start to work on the docs now. In summary:
Let me know what you think @sneako. (: |
I've found a bug with this implementation. When the dynamic supervisor restarts the pool, the PoolManager references are not updated, leading to unexpected behaviour. Gonna work on a fix for this, probably gonna need to store the reference on a persistent term and call it N times when the user call To add and subtract from the atomics I still think no persistent term call will be needed but the calls for Maybe an ETS would be a better fit for what |
Solved! I think the PR is ready for review now! I've made some changes to some test functions to address some problems with flaky tests. Mainly what I did is check for available ports dynamically and reuse the same server for all tests. Let me know your thoughts! (: |
Done @sneako! (: Let me know if you see any other improvement to the proposed implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @oliveigah!
Second proposal to solve #183
After gather some feedback, I settled down on these 2 structs for the metrics:
The
Finch.get_pool_status/2
interface also changed a bit. Now it returns a list of metrics struct, one for each pool started defined by thecount
option. Like this:I've also added some fields to the pool state (such as finch_name and pool_index) in order to enable future changes more easily.