Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batching] first pass at fit and finish for production #4915

Merged
merged 19 commits into from
Apr 5, 2024

Conversation

garypen
Copy link
Contributor

@garypen garypen commented Apr 4, 2024

A lot of changes mainly geared around improving production stability. In no particular order:

 - Keep a reference to the shared Batch in each BatchQuery
 - Introduce SubgraphBatchingError and use in many places
 - set_query_hashes now returns a Result
 - so does signal_cancelled
 - and signal_progress
 - allow our spawned task to return an error
 - remove many redundant TODO messages
 - Don't break loops on error, but continue looping and report all
   errors
 - Eliminate all expect/unwrap in non test code
 - Store the batch size in the Batch (may still remove that...)
 - assemble_batch() returns a Result
 - fix tests so they don't require TEST_APOLLO_KEY
 - Introduce BatchInfo to make map_filter easier to read in
   process_batches()
 - Process assemble_batch() errors in process_batches()
 - Enforce info.len() == txs.len()
 - Add CacheResolverError::BatchingError (and associated code)

Ref: Ref: #4661

Geal and others added 14 commits April 3, 2024 21:02
…#4898)

Since we made our images more secure, we run our router process as user
'router'. If we are running under 'heaptrack', e.g.: in a debug image,
then we cannot write to /dist/data because it is owned by 'root'.

This changes the ownership of /dist/data from 'root' to 'router' to
allow writes to succeed.
## 🐛 Fixes

### Security fix: update h2 dependency

References:
- https://rustsec.org/advisories/RUSTSEC-2024-0332
- https://seanmonstar.com/blog/hyper-http2-continuation-flood/
- https://www.kb.cert.org/vuls/id/421644

The router's performance could be degraded when receiving a flood of
HTTP/2 CONTINUATION frames, when the Router is set up to terminate TLS
for client connections.

By [@Geal](https://github.com/geal)
Follow-up to the v1.43.2 being officially released, bringing version
bumps and changelog updates into the `dev` branch.
It can be difficult to understand 'router service call failed' messages.
Adding the error detail should make them more comprehensible.

fixes: #4899
Fix #4834

This extends the schema aware hashing already employed for subgraph queries
in entity caching, to be calculated for client queries, and look at that hash in
the query planner cache to be able to reuse cached entries across schema
reloads.
This contains:
- an update of the traverse visitor to use an `ExecutableDocument`
(necessary to parse field sets in `key` and `requires` argument
- parses field sets in `@join__type`'s `key` argument and
`@join__field`'s `requires` argument, because they can be affected by
schema updates
- remove the hack around `_entities` operation
- parse subgraph queries using the subgraph schemas extracted from the
supergraph schema
- update query planner cache warm up to check if the query hash changed
[![Mend
Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com)

This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [which](https://togithub.com/harryfei/which-rs) | dependencies | major
| `5.0.0` -> `6.0.1` |

---

### Release Notes

<details>
<summary>harryfei/which-rs (which)</summary>

###
[`v6.0.1`](https://togithub.com/harryfei/which-rs/blob/HEAD/CHANGELOG.md#601)

[Compare
Source](https://togithub.com/harryfei/which-rs/compare/6.0.0...6.0.1)

- Remove dependency on `once_cell` for Windows users, replace with
`std::sync::OnceLock`.

###
[`v6.0.0`](https://togithub.com/harryfei/which-rs/blob/HEAD/CHANGELOG.md#600)

[Compare
Source](https://togithub.com/harryfei/which-rs/compare/5.0.0...6.0.0)

-   MSRV is now 1.70
-   Upgraded all dependencies to latest version

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined),
Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://www.mend.io/free-developer-tools/renovate/). View
repository job log
[here](https://developer.mend.io/github/apollographql/router).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4xNzMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjI2MS4wIiwidGFyZ2V0QnJhbmNoIjoiZGV2In0=-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
This attribute is not yet implemented, we are tracking the work to do
that in #4830

However until we actually implement it we should remove from the docs
Fix #4880 

The entity cache plugin intended to require a `Cache-Control` header
from the subgraph to decide whether or not a response should be cached.
Unfortunately in the way tit was set up, all responses were stored.
The plugin now makes sure that the `Cache-Control` is there, and if a
subgraph does not provide it, then the aggregated `Cache-Control` header
sent to the client will contain `no-store`.

Additionally, the Router will now check that a TTL is configured for all
subgraphs, either in per subgraph configuration, or globally.
A lot of changes mainly geared around improving production stability. In
no particular order:

 - Keep a reference to the shared Batch in each BatchQuery
 - Introduce SubgraphBatchingError and use in many places
 - set_query_hashes now returns a Result
 - so does signal_cancelled
 - and signal_progress
 - allow our spawned task to return an error
 - remove many redundant TODO messages
 - Don't break loops on error, but continue looping and report all
   errors
 - Eliminate all expect/unwrap in non test code
 - Store the batch size in the Batch (may still remove that...)
 - assemble_batch() returns a Result
 - fix tests so they don't require TEST_APOLLO_KEY
 - Introduce BatchInfo to make map_filter easier to read in
   process_batches()
 - Process assemble_batch() errors in process_batches()
 - Enforce info.len() == txs.len()
 - Add CacheResolverError::BatchingError (and associated code)

Quite a lot of changes...
Fixes #3388

In GraphQL requests, `extensions` is an optional map. Passing an
explicit `null` was incorrectly considered a parse error. Now it is
equivalent to omiting that field entirely, or to passing an empty map.

<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [ ] Changes are compatible[^1]
- [ ] Documentation[^2] completed
- [ ] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [ ] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.

---------

Co-authored-by: Bryn Cooke <[email protected]>
Co-authored-by: bryn <[email protected]>
@garypen garypen requested a review from nicholascioli April 4, 2024 13:49
@garypen garypen self-assigned this Apr 4, 2024
@router-perf
Copy link

router-perf bot commented Apr 4, 2024

CI performance tests

  • reload - Reload test over a long period of time at a constant rate of users
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • large-request - Stress test with a 1 MB request payload
  • const - Basic stress test that runs with a constant number of users
  • no-graphos - Basic stress test, no GraphOS.
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • xxlarge-request - Stress test with 100 MB request payload
  • xlarge-request - Stress test with 10 MB request payload
  • step - Basic stress test that steps up the number of users over time

@garypen garypen changed the title First pass at fit and finish for production [batching] first pass at fit and finish for production Apr 4, 2024
tninesling and others added 2 commits April 4, 2024 10:00
Estimates the cost of a query plan by aggregating the costs of
operations in the individual fetch nodes. For multiple nodes, the total
cost is the sum of the individual costs. For a conditional branch, the
cost is the max of the two branches. All deferred nodes count toward the
static cost as if they were not deferred.

[ROUTER-174](https://apollographql.atlassian.net/browse/ROUTER-174)

<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [X] Changes are compatible[^1]
- [ ] Documentation[^2] completed
- [ ] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [X] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.


[ROUTER-174]:
https://apollographql.atlassian.net/browse/ROUTER-174?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
@garypen garypen requested a review from a team as a code owner April 4, 2024 15:05
@garypen garypen removed the request for review from a team April 4, 2024 15:06
tninesling and others added 2 commits April 4, 2024 10:40
Introduces a third method to `CostCalculator` which scores the cost of a
GraphQL response.

The first stage of scoring a response is to zip it together with the
request. The initial implementation does not use the request fields, but
this will be required to support custom cost. In that case, we will need
to check the corresponding request field definition to see if a custom
cost needs to be applied to a particular response field. This
information is stored in a counterpart to `serde_json::Value`, which is
called `TypedValue`. This `TypedValue` enum pairs each JSON value with
the corresponding `apollo_compiler::executable::Field`.

Once the response values are paired with their corresponding fields, it
is simple to traverse the JSON structure, taking query or schema
directives into account.

Implements
[ROUTER-175](https://apollographql.atlassian.net/browse/ROUTER-175)

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [X] Changes are compatible[^1]
- [ ] Documentation[^2] completed
- [ ] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [X] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.


[ROUTER-175]:
https://apollographql.atlassian.net/browse/ROUTER-175?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
Copy link
Contributor

@nicholascioli nicholascioli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried that if we have the test keys enforced only when checking the result, then it might fail when running it locally without those keys since the feature requires a key.

Besides that, I left a few nits and suggestions!

apollo-router/src/batching.rs Outdated Show resolved Hide resolved
spawn_handle: JoinHandle<Result<(), BoxError>>,

/// What is the size (number of input operations) of the batch?
#[allow(dead_code)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be allowed if nothing is using it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now. I'm going to do another pass before starting code review and I'm trying to make my mind up if this is useful (for debug logs) or not. I feel like it might be useful to have it appear in the debug representation of a Batch, but rustc can't work that out.

apollo-router/src/batching.rs Outdated Show resolved Hide resolved
apollo-router/src/batching.rs Show resolved Hide resolved
let (op_name, _context, request, txs) = assemble_batch(requests).await;
let (op_name, _context, request, txs) = assemble_batch(requests)
.await
.expect("it can assemble a batch");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we expect this to never fail, should assemble_batch not return a result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand. This is a test, so if it does fail, the test will fail and we have an issue to resolve.

apollo-router/src/services/router/service.rs Outdated Show resolved Hide resolved
apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved
}))
.await
.into_iter()
.filter_map(|x: Result<BatchInfo, BoxError>| x.map_err(|e| errors.push(e)).ok())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a super nit, but do we want this lambda to have side-effects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the only way I can think of to handle both errors and oks. Alternative approaches that achieve that are fine by me.

@garypen
Copy link
Contributor Author

garypen commented Apr 5, 2024

I'm a little worried that if we have the test keys enforced only when checking the result, then it might fail when running it locally without those keys since the feature requires a key.

I couldn't think of a good way of enforcing this, so I looked at what file upload does and it, effectively, does the same thing. It's a bit more fiddly here, because we return a Vec<Response> so I needed to come up with a slightly more intrusive solution.

If you run cargo test without a key you get a whole load of failures and I find that more intrusive. Also: I don't understand this comment:

Note: The [IntegrationTest] ensures that these test credentials get
            // set before running the router.

I don't get that behaviour, I just get a bunch of test fails.

Besides that, I left a few nits and suggestions!

@garypen garypen merged commit 7a93557 into garypen/2002-subgraph-batching Apr 5, 2024
2 of 5 checks passed
@garypen garypen deleted the garypen/fit-and-finish-part-1 branch April 5, 2024 07:25
@apollo-bot2
Copy link

Detected SAST Vulnerabilities

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants