-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
umbrella: futures channels aren't as bounded as we would like #4103
Comments
6 tasks
garypen
added a commit
that referenced
this issue
Oct 30, 2023
There were a number of issues with the apollo metrics exporter which meant that under load the router would look as though it was leaking memory. It isn't a leak, strictly speaking, but is in fact "lingering" memory. TLDR The sum effect of these changes is that submit_report() is faster and more likely to execute and this means the impact of not being able to receive data is significantly reduced. If, however, we still can't receive data, then the change to discard metrics when the channel is full (which is what the behaviour should always have been) will prevent us from uncontrolled memory growth. If you are interested in details, then the factors which conspire to cause the problems are: 1. Bound channels should really be bound. (see #4103 for more details) It transpires that the futures channel isn't "bound" as tight as we would like in the face of sender clones. Replacing the futures channel with a tokio channel removes the possiblity of memory expansion due to an unresponsive receiver. Now, if the receiver isn't responsive enough, the channel will start rejecting metrics. 2. Altering the priority of the ordering of select! processing with the `biased` statement means the router will favour submitting metrics reports to studio over reading more data and updating reports. This is fine because we do want to send data promptly when a `tick` occurs. 3. Speeding up the `submit_report()` function reduces the possibility of memory bloat when the studio ingress stops accepting metrics.The function is used for submitting traces and metrics, but now checks if traces are present and does the existing retry behaviour if they are. If only metrics, then no retries are performed and the metrics are buffered until the next attempt to submit a report. fixes: #4108 <!-- start metadata --> --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [x] Changes are compatible[^1] - [x] Documentation[^2] completed - [x] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.
garypen
added a commit
that referenced
this issue
Nov 3, 2023
This group of channels had to be replaced in one go because they are all inter-related. fixes: #4103
6 tasks
garypen
added a commit
that referenced
this issue
Nov 14, 2023
This group of channels had to be replaced in one go because they are all inter-related. fixes: #4103 <!-- start metadata --> --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [x] Changes are compatible[^1] - [x] Documentation[^2] completed - [x] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.
Merged
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Based on this comment: rust-lang/futures-rs#2287 (comment)
it's apparent that a "bounded" channel isn't really "bounded" (at least not as I understand a bound), when you consider that cloning a
Sender
means that you will get a parkedSenderTask
.To confirm my suspicion, I ran a test with a modified router which never read any of the messages from a futures channel and sure enough, the senders never failed and the router's memory grew at a rapid rate as memory was consumed on the sender side of the channel.
The main problem are where this first manifested is in the Apollo Exporter code and I've made the changes to convert the
futures
channel to a tokio channel. I then looked at doing this for all thefutures
channels we are using, but quickly realised that it wasn't a straightforward port because of the complexities of interactions with streams in various places in the router.This umbrella issue lists all the locations where we would need to do more conversion work:
Tasks
Note: I generated that list from a grep through the source code and so some of them might not require conversion when properly evaluated and some may be quite simple to do (like the apollo exporter was). However, I already know the first one that I looked at was tricky (files.rs), so YMMV. I'm going to update the urgency of these as and when I get time.
The text was updated successfully, but these errors were encountered: