[batching] first pass at fit and finish for production #4915

garypen · 2024-04-04T13:49:24Z

A lot of changes mainly geared around improving production stability. In no particular order:

 - Keep a reference to the shared Batch in each BatchQuery
 - Introduce SubgraphBatchingError and use in many places
 - set_query_hashes now returns a Result
 - so does signal_cancelled
 - and signal_progress
 - allow our spawned task to return an error
 - remove many redundant TODO messages
 - Don't break loops on error, but continue looping and report all
   errors
 - Eliminate all expect/unwrap in non test code
 - Store the batch size in the Batch (may still remove that...)
 - assemble_batch() returns a Result
 - fix tests so they don't require TEST_APOLLO_KEY
 - Introduce BatchInfo to make map_filter easier to read in
   process_batches()
 - Process assemble_batch() errors in process_batches()
 - Enforce info.len() == txs.len()
 - Add CacheResolverError::BatchingError (and associated code)

Ref: Ref: #4661

…#4898) Since we made our images more secure, we run our router process as user 'router'. If we are running under 'heaptrack', e.g.: in a debug image, then we cannot write to /dist/data because it is owned by 'root'. This changes the ownership of /dist/data from 'root' to 'router' to allow writes to succeed.

@Geal

## 🐛 Fixes ### Security fix: update h2 dependency References: - https://rustsec.org/advisories/RUSTSEC-2024-0332 - https://seanmonstar.com/blog/hyper-http2-continuation-flood/ - https://www.kb.cert.org/vuls/id/421644 The router's performance could be degraded when receiving a flood of HTTP/2 CONTINUATION frames, when the Router is set up to terminate TLS for client connections. By [@Geal](https://github.com/geal)

Follow-up to the v1.43.2 being officially released, bringing version bumps and changelog updates into the `dev` branch.

It can be difficult to understand 'router service call failed' messages. Adding the error detail should make them more comprehensible. fixes: #4899

Fix #4834 This extends the schema aware hashing already employed for subgraph queries in entity caching, to be calculated for client queries, and look at that hash in the query planner cache to be able to reuse cached entries across schema reloads. This contains: - an update of the traverse visitor to use an `ExecutableDocument` (necessary to parse field sets in `key` and `requires` argument - parses field sets in `@join__type`'s `key` argument and `@join__field`'s `requires` argument, because they can be affected by schema updates - remove the hack around `_entities` operation - parse subgraph queries using the subgraph schemas extracted from the supergraph schema - update query planner cache warm up to check if the query hash changed

[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [which](https://togithub.com/harryfei/which-rs) | dependencies | major | `5.0.0` -> `6.0.1` | --- ### Release Notes <details> <summary>harryfei/which-rs (which)</summary> ### [`v6.0.1`](https://togithub.com/harryfei/which-rs/blob/HEAD/CHANGELOG.md#601) [Compare Source](https://togithub.com/harryfei/which-rs/compare/6.0.0...6.0.1) - Remove dependency on `once_cell` for Windows users, replace with `std::sync::OnceLock`. ### [`v6.0.0`](https://togithub.com/harryfei/which-rs/blob/HEAD/CHANGELOG.md#600) [Compare Source](https://togithub.com/harryfei/which-rs/compare/5.0.0...6.0.0) - MSRV is now 1.70 - Upgraded all dependencies to latest version </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/apollographql/router).  Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

This attribute is not yet implemented, we are tracking the work to do that in #4830 However until we actually implement it we should remove from the docs

Fix #4880 The entity cache plugin intended to require a `Cache-Control` header from the subgraph to decide whether or not a response should be cached. Unfortunately in the way tit was set up, all responses were stored. The plugin now makes sure that the `Cache-Control` is there, and if a subgraph does not provide it, then the aggregated `Cache-Control` header sent to the client will contain `no-store`. Additionally, the Router will now check that a TTL is configured for all subgraphs, either in per subgraph configuration, or globally.

A lot of changes mainly geared around improving production stability. In no particular order: - Keep a reference to the shared Batch in each BatchQuery - Introduce SubgraphBatchingError and use in many places - set_query_hashes now returns a Result - so does signal_cancelled - and signal_progress - allow our spawned task to return an error - remove many redundant TODO messages - Don't break loops on error, but continue looping and report all errors - Eliminate all expect/unwrap in non test code - Store the batch size in the Batch (may still remove that...) - assemble_batch() returns a Result - fix tests so they don't require TEST_APOLLO_KEY - Introduce BatchInfo to make map_filter easier to read in process_batches() - Process assemble_batch() errors in process_batches() - Enforce info.len() == txs.len() - Add CacheResolverError::BatchingError (and associated code) Quite a lot of changes...

Fixes #3388 In GraphQL requests, `extensions` is an optional map. Passing an explicit `null` was incorrectly considered a parse error. Now it is equivalent to omiting that field entirely, or to passing an empty map.  --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [ ] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. --------- Co-authored-by: Bryn Cooke <[email protected]> Co-authored-by: bryn <[email protected]>

router-perf · 2024-04-04T13:49:54Z

Estimates the cost of a query plan by aggregating the costs of operations in the individual fetch nodes. For multiple nodes, the total cost is the sum of the individual costs. For a conditional branch, the cost is the max of the two branches. All deferred nodes count toward the static cost as if they were not deferred. [ROUTER-174](https://apollographql.atlassian.net/browse/ROUTER-174)  --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [X] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [X] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. [ROUTER-174]: https://apollographql.atlassian.net/browse/ROUTER-174?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

apollo-router/src/services/subgraph_service.rs

Introduces a third method to `CostCalculator` which scores the cost of a GraphQL response. The first stage of scoring a response is to zip it together with the request. The initial implementation does not use the request fields, but this will be required to support custom cost. In that case, we will need to check the corresponding request field definition to see if a custom cost needs to be applied to a particular response field. This information is stored in a counterpart to `serde_json::Value`, which is called `TypedValue`. This `TypedValue` enum pairs each JSON value with the corresponding `apollo_compiler::executable::Field`. Once the response values are paired with their corresponding fields, it is simple to traverse the JSON structure, taking query or schema directives into account. Implements [ROUTER-175](https://apollographql.atlassian.net/browse/ROUTER-175) **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [X] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [X] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. [ROUTER-175]: https://apollographql.atlassian.net/browse/ROUTER-175?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

nicholascioli

I'm a little worried that if we have the test keys enforced only when checking the result, then it might fail when running it locally without those keys since the feature requires a key.

Besides that, I left a few nits and suggestions!

apollo-router/src/batching.rs

nicholascioli · 2024-04-04T19:08:23Z

apollo-router/src/batching.rs

+    spawn_handle: JoinHandle<Result<(), BoxError>>,
+
+    /// What is the size (number of input operations) of the batch?
+    #[allow(dead_code)]


Do we want this to be allowed if nothing is using it?

For now. I'm going to do another pass before starting code review and I'm trying to make my mind up if this is useful (for debug logs) or not. I feel like it might be useful to have it appear in the debug representation of a Batch, but rustc can't work that out.

apollo-router/src/batching.rs

nicholascioli · 2024-04-04T19:17:18Z

apollo-router/src/batching.rs

-        let (op_name, _context, request, txs) = assemble_batch(requests).await;
+        let (op_name, _context, request, txs) = assemble_batch(requests)
+            .await
+            .expect("it can assemble a batch");


If we expect this to never fail, should assemble_batch not return a result?

Not sure I understand. This is a test, so if it does fail, the test will fail and we have an issue to resolve.

apollo-router/src/services/router/service.rs

apollo-router/src/services/subgraph_service.rs

nicholascioli · 2024-04-04T19:19:42Z

apollo-router/src/services/subgraph_service.rs

        }))
        .await
        .into_iter()
+        .filter_map(|x: Result<BatchInfo, BoxError>| x.map_err(|e| errors.push(e)).ok())


This is a super nit, but do we want this lambda to have side-effects?

It's the only way I can think of to handle both errors and oks. Alternative approaches that achieve that are fine by me.

apollo-router/src/services/subgraph_service.rs

garypen · 2024-04-05T07:12:37Z

I'm a little worried that if we have the test keys enforced only when checking the result, then it might fail when running it locally without those keys since the feature requires a key.

I couldn't think of a good way of enforcing this, so I looked at what file upload does and it, effectively, does the same thing. It's a bit more fiddly here, because we return a Vec<Response> so I needed to come up with a slightly more intrusive solution.

If you run cargo test without a key you get a whole load of failures and I find that more intrusive. Also: I don't understand this comment:

Note: The [IntegrationTest] ensures that these test credentials get
            // set before running the router.

I don't get that behaviour, I just get a bunch of test fails.

Besides that, I left a few nits and suggestions!

apollo-bot2 · 2024-04-05T07:25:37Z

Detected SAST Vulnerabilities

Geal and others added 14 commits April 3, 2024 21:02

Security fix: update h2 dependency

e2a9c05

prep release: v1.43.2-rc.0

190ba90

release: v1.43.2

229fcc4

Reconcile dev after merge to main for v1.43.2 (#4912)

d7dcd31

Follow-up to the v1.43.2 being officially released, bringing version bumps and changelog updates into the `dev` branch.

add error details to 'router service call failed' (#4900)

114825d

It can be difficult to understand 'router service call failed' messages. Adding the error detail should make them more comprehensible. fixes: #4899

[docs] Remove supergraph selector response body (#4905)

e9e2435

This attribute is not yet implemented, we are tracking the work to do that in #4830 However until we actually implement it we should remove from the docs

fix a couple of things I saw in the last review

e000620

garypen requested a review from nicholascioli April 4, 2024 13:49

garypen self-assigned this Apr 4, 2024

garypen changed the title ~~First pass at fit and finish for production~~ [batching] first pass at fit and finish for production Apr 4, 2024

tninesling and others added 2 commits April 4, 2024 10:00

Merge branch 'dev' into garypen/fit-and-finish-part-1

f3a31e9

garypen requested a review from a team as a code owner April 4, 2024 15:05

garypen removed the request for review from a team April 4, 2024 15:06

garypen commented Apr 4, 2024

View reviewed changes

apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved

garypen commented Apr 4, 2024

View reviewed changes

apollo-router/src/services/subgraph_service.rs Outdated Show resolved Hide resolved

tninesling and others added 2 commits April 4, 2024 10:40

Merge branch 'dev' into garypen/fit-and-finish-part-1

5ce2593

nicholascioli requested changes Apr 4, 2024

View reviewed changes

code review changes

4949a97

garypen merged commit 7a93557 into garypen/2002-subgraph-batching Apr 5, 2024
2 of 5 checks passed

garypen deleted the garypen/fit-and-finish-part-1 branch April 5, 2024 07:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[batching] first pass at fit and finish for production #4915

[batching] first pass at fit and finish for production #4915

garypen commented Apr 4, 2024 •

edited by abernix

Loading

router-perf bot commented Apr 4, 2024

nicholascioli left a comment

nicholascioli Apr 4, 2024

garypen Apr 5, 2024

nicholascioli Apr 4, 2024

garypen Apr 5, 2024

nicholascioli Apr 4, 2024

garypen Apr 5, 2024

garypen commented Apr 5, 2024

apollo-bot2 commented Apr 5, 2024

[batching] first pass at fit and finish for production #4915

[batching] first pass at fit and finish for production #4915

Conversation

garypen commented Apr 4, 2024 • edited by abernix Loading

router-perf bot commented Apr 4, 2024

nicholascioli left a comment

Choose a reason for hiding this comment

nicholascioli Apr 4, 2024

Choose a reason for hiding this comment

garypen Apr 5, 2024

Choose a reason for hiding this comment

nicholascioli Apr 4, 2024

Choose a reason for hiding this comment

garypen Apr 5, 2024

Choose a reason for hiding this comment

nicholascioli Apr 4, 2024

Choose a reason for hiding this comment

garypen Apr 5, 2024

Choose a reason for hiding this comment

garypen commented Apr 5, 2024

apollo-bot2 commented Apr 5, 2024

Detected SAST Vulnerabilities

garypen commented Apr 4, 2024 •

edited by abernix

Loading