-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup batching to handle aborted requests #4825
Cleanup batching to handle aborted requests #4825
Conversation
…g. The max cost setting is planned to come from the limits block of the configuration, so it is not included in this change. The plugin is currently not registered so it does not appear in the config schema during development
Clippy is complaining Co-authored-by: bryn <[email protected]>
…cense enforcement test
Co-authored-by: Bryn Cooke <[email protected]>
…ol config is explicitly provided
Created scaffolding for new demand control plugin with a simple config. The max cost setting is planned to come from the limits block of the configuration, so it is not included in this change. The plugin is currently not registered so it does not appear in the config schema during development Fixes [ROUTER-181](https://apollographql.atlassian.net/browse/ROUTER-181) <!-- start metadata --> --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [ ] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. [ROUTER-181]: https://apollographql.atlassian.net/browse/ROUTER-181?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
CI performance tests
|
The current state of this PR is just to test if we agree on how the batching itself should work outside of the rest of the router. More commits will be added to integrate these changes into the rest of the router later. I've left TODOs in spots that I think warrant discussion and lengthy comments everywhere else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to address as many of the TODO comments as I could.
It seems like the main complication is "how do we identify exactly which parts of the batch are dead" when we kill part of the batch. I believe the way to do this is to replace storing subgraph fetch counts with the collection of "FetchNode" operations (or some memo-isation of them) and then simply using the index to remove that collection of operations from the fetches we would perform in our batches.
I might spend some time investigating that today.
More comments:
I think that the BatchQuery
structs need to drop their sender once they either Abort
or Ready
. That will make the processing logic a bit simpler in the spawned handler because it can check for batch completion when it terminates the loop of message receives. If the batch isn't ready at that point, we have decisions to make. Could we just decide to go ahead with what we have and log warnings for the missing components? Or should we panic with the missing components? Anyway, I do think this is a good way to check for batch completion in the face of badly behaved client side participation.
self.shared.lock().finished() | ||
/// Inform the batch of the amount of fetches needed for this element of the batch query | ||
pub(crate) async fn set_subgraph_fetches(&self, fetches: usize) { | ||
// TODO: How should we handle the sender dying? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If send
fails, either:
- We are out of channel capacity (very unlikely if we configure that to be large enough)
- The receiver has gone away
Assuming we can ignore the first cause (for all practical purposes), then the second cause indicates that batch processing is just not going to work, so let's return an error and let the invoker figure that out.
So: For either case we need to return a Result
from this function and deal with the consequences in the invoker.
apollo-router/src/batching.rs
Outdated
guard.finished = true; | ||
std::mem::take(&mut guard.waiters) | ||
} | ||
// TODO: How should we handle the sender dying? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
apollo-router/src/batching.rs
Outdated
} | ||
enum BatchHandlerMessage { | ||
/// Abort one of the sub requests | ||
// TODO: How do we know which of the subfetches of the entire query to abort? Is it all of them? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the part of the problem I was poking around with yesterday, trying to figure out if there's a unique way to identify subgraph fetches with input operations.
I do think this is solve-able with a combination of index
and Vec<QueryHash>
, since we could store both of those details when we call UpdateFetchCountForIndex. Then, when aborting we pass the
indexand use the stored
Vec` to remove fetches.
(This needs to be investigate though)
It's ok to kill all of the sub-fetches until we figure this out.
apollo-router/src/batching.rs
Outdated
/// batch lifecycle. | ||
pub(crate) fn spawn_handler(size: usize) -> Self { | ||
// Create the message channel pair for sending update events to the spawned task | ||
// TODO: Should the upper limit here be configurable? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be over-kill. There are many places in the router where we set arbitrarily high figures on resources which would be hard for an end-user to figure out. Realistically, message processing should be very quick and we shouldn't need even more than, say, 20 channels. So 100 should be plenty for all realistic scenarios.
apollo-router/src/batching.rs
Outdated
// We need to keep track of how many sub fetches are needed per query, | ||
// but that information won't be known until each portion of the entire | ||
// batch passes query planning. | ||
// TODO: Is there a more efficient data structure for keeping track of all of these counts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need to enhance this at some point with QueryHash
details as well.
}); | ||
} | ||
|
||
// TODO: Do we want to handle if a query is outside of the range of the batch size? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a reasonable cause of a panic. That really would be a programming error.
}); | ||
} | ||
|
||
// TODO: Do we want to handle if a query is outside of the range of the batch size? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also seems like a good reason to panic.
apollo-router/src/batching.rs
Outdated
response_sender, | ||
} => { | ||
// If we got a message that a subfetch from an aborted request has arrived, just short it out now | ||
// TODO: How do we handle the fetch not being present? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it should be a panic, because that would be a serious programming logic error.
apollo-router/src/batching.rs
Outdated
// TODO: How should we handle send failure here? | ||
response_sender | ||
.send(Err(Box::new(FetchError::SubrequestBatchingError { | ||
// TODO: How should we get this? The field subgraph_name seems wrong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not wrong.
value.push(Waiter::new(request, body, context, tx)); | ||
rx | ||
/// Create a batch query for a specific index in this batch | ||
// TODO: Do we want to panic / error if the index is out of range? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should panic.
apollo-router/src/batching.rs
Outdated
shared: Arc<Mutex<Batch>>, | ||
|
||
/// A channel sender for sending updates to the entire batch | ||
sender: mpsc::Sender<BatchHandlerMessage>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If my idea about dropping this once the BatchQuery
is done talking to the entire batch is a good one, this will need to be an Option
.
in a Redis cluster, entities can be stored in different nodes, which makes MGET fails. This splits the MGET query in multiple MGETs calls grouped by key hash, to make sure each one will get to the corresponding node, then merges responses in the correct order.
…#4802) Add the ability to enable heartbeat for cases where the subgraph drops idle connections. For example, https://netflix.github.io/dgs/ Example of configuration: ```yaml subscription: mode: passthrough: all: path: /graphql heartbeat_interval: enable #Optional ``` Fixes #4621 <!-- start metadata --> --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [x] Changes are compatible[^1] - [x] Documentation[^2] completed - [x] Performance impact assessed and acceptable - Tests added and passing[^3] - [x] Unit Tests - [ ] Integration Tests - [x] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. --------- Co-authored-by: Jesse Rosenberger <[email protected]>
I've made it compile again. It also still passes the I still need to address your comments and fix some hacks, but you should be able to run the router now using the new batching messaging. |
Each integration tests takes time to link, so this will help with build times. It also makes things a bit less chaotic as we can have modules for integration tests. <!-- start metadata --> --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [ ] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. Co-authored-by: bryn <[email protected]>
Basically working, but lots of hackery...
Follow-up to the v1.43.0 being officially released, bringing version bumps and changelog updates into the `dev` branch.
This commit makes each batch request process concurrently and independently of each other, allowing portions of a batch to still succeed even if other portions of the batch do not.
To enable correlation between trace and logs trace_id and span_id should be included on the log messages. Json: ``` {"timestamp":"2024-03-19T15:37:41.516453239Z","level":"INFO","trace_id":"54ac7e5f0e8ab90ae67b822e95ffcbb8","span_id":"9b3f88c602de0ceb","message":"Supergraph GraphQL response"..... ``` Text: ``` 2024-03-19T15:14:46.040435Z INFO trace_id: bbafc3f048b6137375dd78c10df18f50 span_id: 40ede28c5df1b5cc router{ ``` Configuration options have been added in logging for text and json formats defaulting to true for json and false for text. Json: ``` exporters: logging: stdout: format: json: display_span_id: true display_trace_id: true ``` Text: ``` exporters: logging: stdout: format: text: display_span_id: true display_trace_id: true ``` <!-- start metadata --> --- **Checklist** Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review. - [ ] Changes are compatible[^1] - [ ] Documentation[^2] completed - [ ] Performance impact assessed and acceptable - Tests added and passing[^3] - [ ] Unit Tests - [ ] Integration Tests - [ ] Manual Tests **Exceptions** *Note any exceptions here* **Notes** [^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. --------- Co-authored-by: bryn <[email protected]> Co-authored-by: Edward Huang <[email protected]>
So that all unit tests pass, including: it_will_process_a_non_batched_defered_query There's probably more follow-up work required here, but this is "good enough for now".
7612e34
into
garypen/2002-subgraph-batching
This PR cleans up the batching interface to allow for handling cancelled requests or errors during batch construction.
Checklist
Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.
Exceptions
Note any exceptions here
Notes
Footnotes
It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. ↩
Configuration is an important part of many changes. Where applicable please try to document configuration examples. ↩
Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions. ↩