Cleanup batching to handle aborted requests #4825

nicholascioli · 2024-03-19T21:45:30Z

This PR cleans up the batching interface to allow for handling cancelled requests or errors during batch construction.

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

Exceptions

Note any exceptions here

Notes

It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩

…rocessing_time` (#4806) Fixes #4804

…g. The max cost setting is planned to come from the limits block of the configuration, so it is not included in this change. The plugin is currently not registered so it does not appear in the config schema during development

Clippy is complaining Co-authored-by: bryn <[email protected]>

…cense enforcement test

Co-authored-by: Bryn Cooke <[email protected]>

…ol config is explicitly provided

Created scaffolding for new demand control plugin with a simple config. The max cost setting is planned to come from the limits block of the configuration, so it is not included in this change. The plugin is currently not registered so it does not appear in the config schema during development Fixes [ROUTER-181](https://apollographql.atlassian.net/browse/ROUTER-181)  --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [ ] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. [ROUTER-181]: https://apollographql.atlassian.net/browse/ROUTER-181?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

router-perf · 2024-03-19T21:46:01Z

nicholascioli · 2024-03-19T21:48:19Z

The current state of this PR is just to test if we agree on how the batching itself should work outside of the rest of the router. More commits will be added to integrate these changes into the rest of the router later.

I've left TODOs in spots that I think warrant discussion and lengthy comments everywhere else.

garypen

I've tried to address as many of the TODO comments as I could.

It seems like the main complication is "how do we identify exactly which parts of the batch are dead" when we kill part of the batch. I believe the way to do this is to replace storing subgraph fetch counts with the collection of "FetchNode" operations (or some memo-isation of them) and then simply using the index to remove that collection of operations from the fetches we would perform in our batches.

I might spend some time investigating that today.

More comments:

I think that the BatchQuery structs need to drop their sender once they either Abort or Ready. That will make the processing logic a bit simpler in the spawned handler because it can check for batch completion when it terminates the loop of message receives. If the batch isn't ready at that point, we have decisions to make. Could we just decide to go ahead with what we have and log warnings for the missing components? Or should we panic with the missing components? Anyway, I do think this is a good way to check for batch completion in the face of badly behaved client side participation.

garypen · 2024-03-20T08:40:16Z

apollo-router/src/batching.rs

-        self.shared.lock().finished()
+    /// Inform the batch of the amount of fetches needed for this element of the batch query
+    pub(crate) async fn set_subgraph_fetches(&self, fetches: usize) {
+        // TODO: How should we handle the sender dying?


If send fails, either:

We are out of channel capacity (very unlikely if we configure that to be large enough)

The receiver has gone away
Assuming we can ignore the first cause (for all practical purposes), then the second cause indicates that batch processing is just not going to work, so let's return an error and let the invoker figure that out.

So: For either case we need to return a Result from this function and deal with the consequences in the invoker.

garypen · 2024-03-20T08:40:57Z

apollo-router/src/batching.rs

-        guard.finished = true;
-        std::mem::take(&mut guard.waiters)
-    }
+        // TODO: How should we handle the sender dying?


garypen · 2024-03-20T08:45:58Z

apollo-router/src/batching.rs

-    }
+enum BatchHandlerMessage {
+    /// Abort one of the sub requests
+    // TODO: How do we know which of the subfetches of the entire query to abort? Is it all of them?


This is the part of the problem I was poking around with yesterday, trying to figure out if there's a unique way to identify subgraph fetches with input operations.
I do think this is solve-able with a combination of index and Vec<QueryHash>, since we could store both of those details when we call UpdateFetchCountForIndex. Then, when aborting we pass the indexand use the storedVec` to remove fetches.
(This needs to be investigate though)

It's ok to kill all of the sub-fetches until we figure this out.

garypen · 2024-03-20T08:47:53Z

apollo-router/src/batching.rs

+    /// batch lifecycle.
+    pub(crate) fn spawn_handler(size: usize) -> Self {
+        // Create the message channel pair for sending update events to the spawned task
+        // TODO: Should the upper limit here be configurable?


I think that would be over-kill. There are many places in the router where we set arbitrarily high figures on resources which would be hard for an end-user to figure out. Realistically, message processing should be very quick and we shouldn't need even more than, say, 20 channels. So 100 should be plenty for all realistic scenarios.

garypen · 2024-03-20T08:48:57Z

apollo-router/src/batching.rs

+            // We need to keep track of how many sub fetches are needed per query,
+            // but that information won't be known until each portion of the entire
+            // batch passes query planning.
+            // TODO: Is there a more efficient data structure for keeping track of all of these counts?


We may need to enhance this at some point with QueryHash details as well.

garypen · 2024-03-20T08:56:03Z

apollo-router/src/batching.rs

+                        });
+                    }
+
+                    // TODO: Do we want to handle if a query is outside of the range of the batch size?


Seems like a reasonable cause of a panic. That really would be a programming error.

garypen · 2024-03-20T08:56:32Z

apollo-router/src/batching.rs

+                        });
+                    }
+
+                    // TODO: Do we want to handle if a query is outside of the range of the batch size?


Also seems like a good reason to panic.

garypen · 2024-03-20T08:57:06Z

apollo-router/src/batching.rs

+                        response_sender,
+                    } => {
+                        // If we got a message that a subfetch from an aborted request has arrived, just short it out now
+                        // TODO: How do we handle the fetch not being present?


This feels like it should be a panic, because that would be a serious programming logic error.

garypen · 2024-03-20T08:59:11Z

apollo-router/src/batching.rs

+                            // TODO: How should we handle send failure here?
+                            response_sender
+                                .send(Err(Box::new(FetchError::SubrequestBatchingError {
+                                    // TODO: How should we get this? The field subgraph_name seems wrong


It's not wrong.

garypen · 2024-03-20T08:59:26Z

apollo-router/src/batching.rs

-        value.push(Waiter::new(request, body, context, tx));
-        rx
+    /// Create a batch query for a specific index in this batch
+    // TODO: Do we want to panic / error if the index is out of range?


I think we should panic.

garypen · 2024-03-20T09:21:29Z

apollo-router/src/batching.rs

-    shared: Arc<Mutex<Batch>>,
+
+    /// A channel sender for sending updates to the entire batch
+    sender: mpsc::Sender<BatchHandlerMessage>,


If my idea about dropping this once the BatchQuery is done talking to the entire batch is a good one, this will need to be an Option.

in a Redis cluster, entities can be stored in different nodes, which makes MGET fails. This splits the MGET query in multiple MGETs calls grouped by key hash, to make sure each one will get to the corresponding node, then merges responses in the correct order.

…#4802) Add the ability to enable heartbeat for cases where the subgraph drops idle connections. For example, https://netflix.github.io/dgs/ Example of configuration: ```yaml subscription: mode: passthrough: all: path: /graphql heartbeat_interval: enable #Optional ``` Fixes #4621  --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [x] Changes are compatible[^1] - [x] Documentation[^2] completed - [x] Performance impact assessed and acceptable - Tests added and passing[^3] - [x] Unit Tests - [ ] Integration Tests - [x] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. --------- Co-authored-by: Jesse Rosenberger <[email protected]>

nicholascioli · 2024-03-20T21:57:28Z

I've made it compile again. It also still passes the assemble_batch unit test, which is nice.

I still need to address your comments and fix some hacks, but you should be able to run the router now using the new batching messaging.

Each integration tests takes time to link, so this will help with build times. It also makes things a bit less chaotic as we can have modules for integration tests.  --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [ ] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. Co-authored-by: bryn <[email protected]>

Basically working, but lots of hackery...

Follow-up to the v1.43.0 being officially released, bringing version bumps and changelog updates into the `dev` branch.

This commit makes each batch request process concurrently and independently of each other, allowing portions of a batch to still succeed even if other portions of the batch do not.

To enable correlation between trace and logs trace_id and span_id should be included on the log messages. Json: ``` {"timestamp":"2024-03-19T15:37:41.516453239Z","level":"INFO","trace_id":"54ac7e5f0e8ab90ae67b822e95ffcbb8","span_id":"9b3f88c602de0ceb","message":"Supergraph GraphQL response"..... ``` Text: ``` 2024-03-19T15:14:46.040435Z INFO trace_id: bbafc3f048b6137375dd78c10df18f50 span_id: 40ede28c5df1b5cc router{ ``` Configuration options have been added in logging for text and json formats defaulting to true for json and false for text. Json: ``` exporters: logging: stdout: format: json: display_span_id: true display_trace_id: true ``` Text: ``` exporters: logging: stdout: format: text: display_span_id: true display_trace_id: true ```  --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [ ] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. --------- Co-authored-by: bryn <[email protected]> Co-authored-by: Edward Huang <[email protected]>

So that all unit tests pass, including: it_will_process_a_non_batched_defered_query There's probably more follow-up work required here, but this is "good enough for now".

abernix and others added 13 commits March 15, 2024 13:31

docs: Add note that apollo_router_span is replaced `apollo_router_p…

0f10eb4

…rocessing_time` (#4806) Fixes #4804

Fix failing CI due to clippy check (#4822)

1b007a5

Clippy is complaining Co-authored-by: bryn <[email protected]>

prep release: v1.43.0-rc.0

cb57bc1

Register demand control plugin, and validate its config format via li…

3b371a6

…cense enforcement test

Serialize algorithm names in snake_case

be4911f

Co-authored-by: Bryn Cooke <[email protected]>

Rename default algorithm to basic, and remove default so demand contr…

234bff2

…ol config is explicitly provided

Merge remote-tracking branch 'origin/dev' into tninesling/ROUTER-181

d1df418

Temporarily allow dead code in demand control scaffolding

412ccf6

Add demand control as optional apollo plugin in router factory

82946f0

Ensure demand control config has proper doc comments

6c4087a

refactor Batch

e4657ed

nicholascioli requested a review from garypen March 19, 2024 21:45

nicholascioli self-assigned this Mar 19, 2024

nicholascioli changed the base branch from dev to garypen/2002-subgraph-batching March 19, 2024 21:46

garypen reviewed Mar 20, 2024

View reviewed changes

Geal and others added 3 commits March 20, 2024 11:31

make it compile again

350c9b7

abernix and others added 6 commits March 20, 2024 23:44

prep release: v1.43.0-rc.1

7d939e3

hand over to Nicholas the latest state of subgraph batching

ca1e3cf

Basically working, but lots of hackery...

make rhai test work

adb0750

prep release: v1.43.0

ded84be

release: v1.43.0

4b84858

abernix and others added 2 commits March 22, 2024 12:54

Reconcile dev after merge to main for v1.43.0 (#4835)

947936a

Follow-up to the v1.43.0 being officially released, bringing version bumps and changelog updates into the `dev` branch.

switch batching to concurrent processing

c2d0703

This commit makes each batch request process concurrently and independently of each other, allowing portions of a batch to still succeed even if other portions of the batch do not.

nicholascioli marked this pull request as ready for review March 22, 2024 22:25

BrynCooke and others added 3 commits March 25, 2024 10:14

Re-introduce the dance in router service processing

0441c06

So that all unit tests pass, including: it_will_process_a_non_batched_defered_query There's probably more follow-up work required here, but this is "good enough for now".

Merge branch 'dev' into nc/subgraph-batching/messaging

6587e64

nicholascioli requested a review from a team as a code owner March 25, 2024 12:26

garypen merged commit 7612e34 into garypen/2002-subgraph-batching Mar 25, 2024
10 of 12 checks passed

garypen deleted the nc/subgraph-batching/messaging branch March 25, 2024 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup batching to handle aborted requests #4825

Cleanup batching to handle aborted requests #4825

nicholascioli commented Mar 19, 2024

router-perf bot commented Mar 19, 2024

nicholascioli commented Mar 19, 2024

garypen left a comment •

edited

Loading

garypen Mar 20, 2024

garypen Mar 20, 2024

garypen Mar 20, 2024

garypen Mar 20, 2024

garypen Mar 20, 2024

garypen Mar 20, 2024

garypen Mar 20, 2024

garypen Mar 20, 2024

garypen Mar 20, 2024

garypen Mar 20, 2024

garypen Mar 20, 2024

nicholascioli commented Mar 20, 2024

Cleanup batching to handle aborted requests #4825

Cleanup batching to handle aborted requests #4825

Conversation

nicholascioli commented Mar 19, 2024

Footnotes

router-perf bot commented Mar 19, 2024

nicholascioli commented Mar 19, 2024

garypen left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicholascioli commented Mar 20, 2024

garypen left a comment •

edited

Loading