Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

umbrella: futures channels aren't as bounded as we would like #4103

Closed
11 of 12 tasks
garypen opened this issue Oct 26, 2023 · 0 comments · Fixed by #4138
Closed
11 of 12 tasks

umbrella: futures channels aren't as bounded as we would like #4103

garypen opened this issue Oct 26, 2023 · 0 comments · Fixed by #4138
Assignees

Comments

@garypen
Copy link
Contributor

garypen commented Oct 26, 2023

Based on this comment: rust-lang/futures-rs#2287 (comment)

it's apparent that a "bounded" channel isn't really "bounded" (at least not as I understand a bound), when you consider that cloning a Sender means that you will get a parked SenderTask.

To confirm my suspicion, I ran a test with a modified router which never read any of the messages from a futures channel and sure enough, the senders never failed and the router's memory grew at a rapid rate as memory was consumed on the sender side of the channel.

The main problem are where this first manifested is in the Apollo Exporter code and I've made the changes to convert the futures channel to a tokio channel. I then looked at doing this for all thefutures channels we are using, but quickly realised that it wasn't a straightforward port because of the complexities of interactions with streams in various places in the router.

This umbrella issue lists all the locations where we would need to do more conversion work:

Tasks

Preview Give feedback
  1. garypen
  2. garypen
  3. garypen
  4. garypen

Note: I generated that list from a grep through the source code and so some of them might not require conversion when properly evaluated and some may be quite simple to do (like the apollo exporter was). However, I already know the first one that I looked at was tricky (files.rs), so YMMV. I'm going to update the urgency of these as and when I get time.

garypen added a commit that referenced this issue Oct 30, 2023
There were a number of issues with the apollo metrics exporter which
meant that under load the router would look as though it was leaking
memory. It isn't a leak, strictly speaking, but is in fact "lingering"
memory.

TLDR
The sum effect of these changes is that submit_report() is faster and
more likely to execute and this means the impact of not being able to
receive data is significantly reduced. If, however, we still can't
receive data, then the change to discard metrics when the channel is
full (which is what the behaviour should always have been) will prevent
us from uncontrolled memory growth.

If you are interested in details, then the factors which conspire to
cause the problems are:

1. Bound channels should really be bound. (see #4103 for more details)
It transpires that the futures channel isn't "bound" as tight as we
would like in the face of sender clones. Replacing the futures channel
with a tokio channel removes the possiblity of memory expansion due to
an unresponsive receiver. Now, if the receiver isn't responsive enough,
the channel will start rejecting metrics.
    
2. Altering the priority of the ordering of select! processing with the
`biased` statement means the router will favour submitting metrics
reports to studio over reading more data and updating reports. This is
fine because we do want to send data promptly when a `tick` occurs.
    
3. Speeding up the `submit_report()` function reduces the possibility of
memory bloat when the studio ingress stops accepting metrics.The
function is used for submitting traces and metrics, but now checks if
traces are present and does the existing retry behaviour if they are. If
only metrics, then no retries are performed and the metrics are buffered
until the next attempt to submit a report.

fixes: #4108 

<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [x] Changes are compatible[^1]
- [x] Documentation[^2] completed
- [x] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [ ] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.
@garypen garypen self-assigned this Oct 30, 2023
garypen added a commit that referenced this issue Nov 3, 2023
This group of channels had to be replaced in one go because they are all
inter-related.

fixes: #4103
garypen added a commit that referenced this issue Nov 14, 2023
This group of channels had to be replaced in one go because they are all
inter-related.

fixes: #4103

<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [x] Changes are compatible[^1]
- [x] Documentation[^2] completed
- [x] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [ ] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant