umbrella: futures channels aren't as bounded as we would like #4103

garypen · 2023-10-26T13:28:03Z

Based on this comment: rust-lang/futures-rs#2287 (comment)

it's apparent that a "bounded" channel isn't really "bounded" (at least not as I understand a bound), when you consider that cloning a Sender means that you will get a parked SenderTask.

To confirm my suspicion, I ran a test with a modified router which never read any of the messages from a futures channel and sure enough, the senders never failed and the router's memory grew at a rapid rate as memory was consumed on the sender side of the channel.

The main problem are where this first manifested is in the Apollo Exporter code and I've made the changes to convert the futures channel to a tokio channel. I then looked at doing this for all thefutures channels we are using, but quickly realised that it wasn't a straightforward port because of the complexities of interactions with streams in various places in the router.

This umbrella issue lists all the locations where we would need to do more conversion work:

Tasks

Give feedback

apollo-router/src/files.rs #4109
apollo-router/src/notification.rs #4117
apollo-router/src/plugins/telemetry/metrics/apollo.rs #4108
apollo-router/src/query_planner/execution.rs
apollo-router/src/query_planner/subscription.rs
apollo-router/src/query_planner/tests.rs (NOT URGENT: test only)
apollo-router/src/router/mod.rs (NOT URGENT: test only)
apollo-router/src/services/execution_service.rs
apollo-router/src/services/subgraph.rs
apollo-router/src/services/subgraph_service.rs (NOT URGENT: test only)
apollo-router/src/services/supergraph_service.rs
apollo-router/src/uplink/license_stream.rs #4110
Options

Note: I generated that list from a grep through the source code and so some of them might not require conversion when properly evaluated and some may be quite simple to do (like the apollo exporter was). However, I already know the first one that I looked at was tricky (files.rs), so YMMV. I'm going to update the urgency of these as and when I get time.

The text was updated successfully, but these errors were encountered:

There were a number of issues with the apollo metrics exporter which meant that under load the router would look as though it was leaking memory. It isn't a leak, strictly speaking, but is in fact "lingering" memory. TLDR The sum effect of these changes is that submit_report() is faster and more likely to execute and this means the impact of not being able to receive data is significantly reduced. If, however, we still can't receive data, then the change to discard metrics when the channel is full (which is what the behaviour should always have been) will prevent us from uncontrolled memory growth. If you are interested in details, then the factors which conspire to cause the problems are: 1. Bound channels should really be bound. (see #4103 for more details) It transpires that the futures channel isn't "bound" as tight as we would like in the face of sender clones. Replacing the futures channel with a tokio channel removes the possiblity of memory expansion due to an unresponsive receiver. Now, if the receiver isn't responsive enough, the channel will start rejecting metrics. 2. Altering the priority of the ordering of select! processing with the `biased` statement means the router will favour submitting metrics reports to studio over reading more data and updating reports. This is fine because we do want to send data promptly when a `tick` occurs. 3. Speeding up the `submit_report()` function reduces the possibility of memory bloat when the studio ingress stops accepting metrics.The function is used for submitting traces and metrics, but now checks if traces are present and does the existing retry behaviour if they are. If only metrics, then no retries are performed and the metrics are buffered until the next attempt to submit a report. fixes: #4108  --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [x] Changes are compatible[^1] - [x] Documentation[^2] completed - [x] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

This group of channels had to be replaced in one go because they are all inter-related. fixes: #4103

This group of channels had to be replaced in one go because they are all inter-related. fixes: #4103  --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [x] Changes are compatible[^1] - [x] Documentation[^2] completed - [x] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

garypen mentioned this issue Oct 26, 2023

Fix memory issues in the apollo metrics exporter #4107

Merged

6 tasks

garypen self-assigned this Oct 30, 2023

garypen added a commit that referenced this issue Nov 3, 2023

Replace final group of futures channels with tokio channels

4984a97

This group of channels had to be replaced in one go because they are all inter-related. fixes: #4103

garypen mentioned this issue Nov 3, 2023

Replace final group of futures channels with tokio channels #4138

Merged

6 tasks

garypen closed this as completed in #4138 Nov 14, 2023

abernix mentioned this issue Nov 21, 2023

prep release: v1.34.1 #4245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

umbrella: futures channels aren't as bounded as we would like #4103

umbrella: futures channels aren't as bounded as we would like #4103

garypen commented Oct 26, 2023 •

edited

Loading

Tasks

umbrella: futures channels aren't as bounded as we would like #4103

umbrella: futures channels aren't as bounded as we would like #4103

Comments

garypen commented Oct 26, 2023 • edited Loading

Tasks

garypen commented Oct 26, 2023 •

edited

Loading