Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test graceful shutdown with idle connections #3969

Merged
merged 9 commits into from
Oct 10, 2023
Merged

Conversation

Geal
Copy link
Contributor

@Geal Geal commented Oct 4, 2023

Fix #3124
Fix #3941


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@github-actions

This comment has been minimized.

@router-perf
Copy link

router-perf bot commented Oct 4, 2023

CI performance tests

  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • large-request - Stress test with a 1 MB request payload
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • reload - Reload test over a long period of time at a constant rate of users
  • no-graphos - Basic stress test, no GraphOS.
  • xxlarge-request - Stress test with 100 MB request payload
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • const - Basic stress test that runs with a constant number of users

Geal added 3 commits October 4, 2023 16:00
if a connection was opened to the router but never received the intial
request, ti would not be considered by hyper's graceful shutdown
feature, and so the router would wayt indefinitely for the connection to
close.

This tends to happen when a brower opens the explorer or prometheus page
provided by the router, because browsers eagerly open new connections,
in anticipation of future traffic, while those pages only need a single
request to the router.
@Geal Geal requested review from a team, SimonSapin, lrlna and o0Ignition0o October 4, 2023 14:11
Copy link
Contributor

@BrynCooke BrynCooke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with changelog suggestion

.changesets/fix_geal_idle_connections.md Outdated Show resolved Hide resolved
@garypen garypen self-requested a review October 9, 2023 08:07
Copy link
Contributor

@garypen garypen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have the same problem with UDS. I just manually tested and confirmed that if I have a connection open to the router which never sends any data, then shutdown will hang.

We need to add the same checker for UDS.

Implement the same logic as for TLS and TCP.

I manually confirmed (with: `socat - UNIX-CONNECT:/tmp/router.sock`) that the same problem existed without this fix for unix domain sockets.
The problem goes away with the fix.
@garypen garypen self-requested a review October 9, 2023 12:29
@garypen
Copy link
Contributor

garypen commented Oct 9, 2023

I merged in additional code to address Unix Domain Sockets.

@garypen garypen enabled auto-merge (squash) October 9, 2023 14:08
For instance in CI on ARM or MacOS.

Up the timeout to 1 second.
@garypen garypen merged commit f967a4a into dev Oct 10, 2023
2 checks passed
@garypen garypen deleted the geal/idle-connections branch October 10, 2023 12:47
This was referenced Oct 17, 2023
EverlastingBugstopper referenced this pull request in apollographql/rover Oct 20, 2023
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [apollographql/router](https://togithub.com/apollographql/router) |
minor | `v1.32.0` -> `v1.33.0` |

---

### Release Notes

<details>
<summary>apollographql/router (apollographql/router)</summary>

###
[`v1.33.0`](https://togithub.com/apollographql/router/releases/tag/v1.33.0)

[Compare
Source](https://togithub.com/apollographql/router/compare/v1.32.0-alpha.0...v1.33.0)

#### 🚀 Features

##### Add `hasNext` to SupergraphRequest ([Issue
#&#8203;4016](https://togithub.com/apollographql/router/issues/4016))

Coprocessors multi-part response support has been enhanced to include
`hasNext`, allowing you to determine when a request has completed.

When `stage` is `SupergraphResponse`, `hasNext` if present and `true`
indicates that there will be subsequent `SupergraphResponse` calls to
the co-processor for each multi-part (`@defer`/subscriptions) response.

See the [coprocessor
documentation](https://www.apollographql.com/docs/router/customizations/coprocessor/)
for more details.

By [@&#8203;BrynCooke](https://togithub.com/BrynCooke) in
[https://github.com/apollographql/router/pull/4017](https://togithub.com/apollographql/router/pull/4017)

##### Expose the ability to set topology spread constraints on the helm
chart ([3891](https://togithub.com/apollographql/router/issues/3891))

Give developers the ability to set topology spread constraints that can
be used to guarantee that federation pods are spread out evenly across
AZs.

By [@&#8203;bjoernw](https://togithub.com/bjoernw) in
[https://github.com/apollographql/router/pull/3892](https://togithub.com/apollographql/router/pull/3892)

#### 🐛 Fixes

##### Ignore JWKS keys which aren't supported by the router ([Issue
#&#8203;3853](https://togithub.com/apollographql/router/issues/3853))

If you have a JWKS which contains a key which has an algorithm (alg)
which the router doesn't recognise, then the entire JWKS is disregarded
even if there were other keys in the JWKS which the router could use.

We have changed the JWKS processing logic so that we remove entries with
an unrecognised algorithm from the list of available keys. We print a
warning with the name of the algorithm for each removed entry.

By [@&#8203;garypen](https://togithub.com/garypen) in
[https://github.com/apollographql/router/pull/3922](https://togithub.com/apollographql/router/pull/3922)

##### Fix panic when streaming responses to co-processor ([Issue
#&#8203;4013](https://togithub.com/apollographql/router/issues/4013))

Streamed responses will no longer cause a panic in the co-processor
plugin. This affected defer and stream queries.

By [@&#8203;BrynCooke](https://togithub.com/BrynCooke) in
[https://github.com/apollographql/router/pull/4014](https://togithub.com/apollographql/router/pull/4014)

##### Only reject defer/subscriptions if actually part of a batch
([Issue
#&#8203;3956](https://togithub.com/apollographql/router/issues/3956))

Fix the checking logic so that deferred queries or subscriptions will
only be rejected when experimental batching is enabled and the
operations are part of a batch.

Without this fix, all subscriptions or deferred queries would be
rejected when experimental batching support was enabled.

By [@&#8203;garypen](https://togithub.com/garypen) in
[https://github.com/apollographql/router/pull/3959](https://togithub.com/apollographql/router/pull/3959)

##### Fix requires selection in arrays ([Issue
#&#8203;3972](https://togithub.com/apollographql/router/issues/3972))

When a field has a `@requires` annotation that selects an array, and
some fields are missing in that array or some of the elements are null,
the router would short circuit the selection and remove the entire
array. This relaxes the condition to allow nulls in the selected array

By [@&#8203;Geal](https://togithub.com/Geal) in
[https://github.com/apollographql/router/pull/3975](https://togithub.com/apollographql/router/pull/3975)

##### Fix router hang when opening the explorer, prometheus or health
check page ([Issue
#&#8203;3941](https://togithub.com/apollographql/router/issues/3941))

The Router did not gracefully shutdown when an idle connections are made
by a client, and would instead hang. In particular, web browsers make
such connection in anticipation of future traffic.

This is now fixed, and the Router will now gracefully shut down in a
timely fashion.

By [@&#8203;Geal](https://togithub.com/Geal) in
[https://github.com/apollographql/router/pull/3969](https://togithub.com/apollographql/router/pull/3969)

##### Fix hang and high CPU usage when compressing small responses ([PR
#&#8203;3961](https://togithub.com/apollographql/router/pull/3961))

When returning small responses (less than 10 bytes) and compressing them
using gzip, the router could go into an infinite loop

By [@&#8203;Geal](https://togithub.com/Geal) in
[https://github.com/apollographql/router/pull/3961](https://togithub.com/apollographql/router/pull/3961)

#### 📃 Configuration

##### Add `enabled` field for telemetry exporters ([PR
#&#8203;3952](https://togithub.com/apollographql/router/pull/3952))

Telemetry configuration now supports `enabled` on all exporters. This
allows exporters to be disabled without removing them from the
configuration and in addition allows for a more streamlined default
configuration.

```diff
telemetry:
  tracing: 
    datadog:
+      enabled: true
    jaeger:
+      enabled: true
    otlp:
+      enabled: true
    zipkin:
+      enabled: true
```

Existing configurations will be migrated to the new format automatically
on startup. However, you should update your configuration to use the new
format as soon as possible.

By [@&#8203;BrynCooke](https://togithub.com/BrynCooke) in
[https://github.com/apollographql/router/pull/3952](https://togithub.com/apollographql/router/pull/3952)

#### 🛠 Maintenance

##### Create a replacement self-signed server certificate: 10 years
lifespan ([Issue
#&#8203;3998](https://togithub.com/apollographql/router/issues/3998))

This certificate is only used for testing, so 10 years lifespan is
acceptable.

By [@&#8203;garypen](https://togithub.com/garypen) in
[https://github.com/apollographql/router/pull/4009](https://togithub.com/apollographql/router/pull/4009)

#### 📚 Documentation

##### Updated documentation for deploying router ([PR
#&#8203;3943](https://togithub.com/apollographql/router/pull/3943))

Updated documentation for containerized router deployments, with guides
and examples for [deploying on
Kubernetes](https://www.apollographql.com/docs/router/containerization/kubernetes)
and [running on
Docker](https://www.apollographql.com/docs/router/containerization/docker).

By [@&#8203;shorgi](https://togithub.com/shorgi) in
[https://github.com/apollographql/router/pull/3943](https://togithub.com/apollographql/router/pull/3943)

##### Document guidance for request and response buffering ([Issue
#&#8203;3838](https://togithub.com/apollographql/router/issues/3838))

Provides specific guidance on request and response buffering within the
router.

By [@&#8203;garypen](https://togithub.com/garypen) in
[https://github.com/apollographql/router/pull/3970](https://togithub.com/apollographql/router/pull/3970)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/apollographql/rover).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4xOS4yIiwidXBkYXRlZEluVmVyIjoiMzcuMTkuMiIsInRhcmdldEJyYW5jaCI6Im1haW4ifQ==-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better handling of idle initial connections Router does not shut down until all connections are closed
3 participants