Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup batching to handle aborted requests #4825

Merged
merged 27 commits into from
Mar 25, 2024

Conversation

nicholascioli
Copy link
Contributor

This PR cleans up the batching interface to allow for handling cancelled requests or errors during batch construction.


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

abernix and others added 13 commits March 15, 2024 13:31
…g. The max cost setting is planned to come from the limits block of the configuration, so it is not included in this change. The plugin is currently not registered so it does not appear in the config schema during development
Clippy is complaining

Co-authored-by: bryn <[email protected]>
Created scaffolding for new demand control plugin with a simple config.
The max cost setting is planned to come from the limits block of the
configuration, so it is not included in this change. The plugin is
currently not registered so it does not appear in the config schema
during development

Fixes
[ROUTER-181](https://apollographql.atlassian.net/browse/ROUTER-181)

<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [ ] Changes are compatible[^1]
- [ ] Documentation[^2] completed
- [ ] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [ ] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.


[ROUTER-181]:
https://apollographql.atlassian.net/browse/ROUTER-181?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
@nicholascioli nicholascioli requested a review from garypen March 19, 2024 21:45
@nicholascioli nicholascioli self-assigned this Mar 19, 2024
@router-perf
Copy link

router-perf bot commented Mar 19, 2024

CI performance tests

  • reload - Reload test over a long period of time at a constant rate of users
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • large-request - Stress test with a 1 MB request payload
  • const - Basic stress test that runs with a constant number of users
  • no-graphos - Basic stress test, no GraphOS.
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • xxlarge-request - Stress test with 100 MB request payload
  • xlarge-request - Stress test with 10 MB request payload
  • step - Basic stress test that steps up the number of users over time

@nicholascioli nicholascioli changed the base branch from dev to garypen/2002-subgraph-batching March 19, 2024 21:46
@nicholascioli
Copy link
Contributor Author

The current state of this PR is just to test if we agree on how the batching itself should work outside of the rest of the router. More commits will be added to integrate these changes into the rest of the router later.

I've left TODOs in spots that I think warrant discussion and lengthy comments everywhere else.

Copy link
Contributor

@garypen garypen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to address as many of the TODO comments as I could.

It seems like the main complication is "how do we identify exactly which parts of the batch are dead" when we kill part of the batch. I believe the way to do this is to replace storing subgraph fetch counts with the collection of "FetchNode" operations (or some memo-isation of them) and then simply using the index to remove that collection of operations from the fetches we would perform in our batches.

I might spend some time investigating that today.

More comments:

I think that the BatchQuery structs need to drop their sender once they either Abort or Ready. That will make the processing logic a bit simpler in the spawned handler because it can check for batch completion when it terminates the loop of message receives. If the batch isn't ready at that point, we have decisions to make. Could we just decide to go ahead with what we have and log warnings for the missing components? Or should we panic with the missing components? Anyway, I do think this is a good way to check for batch completion in the face of badly behaved client side participation.

self.shared.lock().finished()
/// Inform the batch of the amount of fetches needed for this element of the batch query
pub(crate) async fn set_subgraph_fetches(&self, fetches: usize) {
// TODO: How should we handle the sender dying?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If send fails, either:

  • We are out of channel capacity (very unlikely if we configure that to be large enough)
  • The receiver has gone away
    Assuming we can ignore the first cause (for all practical purposes), then the second cause indicates that batch processing is just not going to work, so let's return an error and let the invoker figure that out.

So: For either case we need to return a Result from this function and deal with the consequences in the invoker.

guard.finished = true;
std::mem::take(&mut guard.waiters)
}
// TODO: How should we handle the sender dying?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above.

}
enum BatchHandlerMessage {
/// Abort one of the sub requests
// TODO: How do we know which of the subfetches of the entire query to abort? Is it all of them?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the part of the problem I was poking around with yesterday, trying to figure out if there's a unique way to identify subgraph fetches with input operations.
I do think this is solve-able with a combination of index and Vec<QueryHash>, since we could store both of those details when we call UpdateFetchCountForIndex. Then, when aborting we pass the indexand use the storedVec` to remove fetches.
(This needs to be investigate though)

It's ok to kill all of the sub-fetches until we figure this out.

/// batch lifecycle.
pub(crate) fn spawn_handler(size: usize) -> Self {
// Create the message channel pair for sending update events to the spawned task
// TODO: Should the upper limit here be configurable?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be over-kill. There are many places in the router where we set arbitrarily high figures on resources which would be hard for an end-user to figure out. Realistically, message processing should be very quick and we shouldn't need even more than, say, 20 channels. So 100 should be plenty for all realistic scenarios.

// We need to keep track of how many sub fetches are needed per query,
// but that information won't be known until each portion of the entire
// batch passes query planning.
// TODO: Is there a more efficient data structure for keeping track of all of these counts?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to enhance this at some point with QueryHash details as well.

});
}

// TODO: Do we want to handle if a query is outside of the range of the batch size?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable cause of a panic. That really would be a programming error.

});
}

// TODO: Do we want to handle if a query is outside of the range of the batch size?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also seems like a good reason to panic.

response_sender,
} => {
// If we got a message that a subfetch from an aborted request has arrived, just short it out now
// TODO: How do we handle the fetch not being present?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it should be a panic, because that would be a serious programming logic error.

// TODO: How should we handle send failure here?
response_sender
.send(Err(Box::new(FetchError::SubrequestBatchingError {
// TODO: How should we get this? The field subgraph_name seems wrong
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not wrong.

value.push(Waiter::new(request, body, context, tx));
rx
/// Create a batch query for a specific index in this batch
// TODO: Do we want to panic / error if the index is out of range?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should panic.

shared: Arc<Mutex<Batch>>,

/// A channel sender for sending updates to the entire batch
sender: mpsc::Sender<BatchHandlerMessage>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my idea about dropping this once the BatchQuery is done talking to the entire batch is a good one, this will need to be an Option.

Geal and others added 3 commits March 20, 2024 11:31
in a Redis cluster, entities can be stored in different nodes, which
makes MGET fails. This splits the MGET query in multiple MGETs calls
grouped by key hash, to make sure each one will get to the corresponding
node, then merges responses in the correct order.
…#4802)

Add the ability to enable heartbeat for cases where the subgraph drops
idle connections.
For example, https://netflix.github.io/dgs/

Example of configuration:

```yaml
subscription:
  mode:
    passthrough:
      all:
        path: /graphql
        heartbeat_interval: enable #Optional
 ```

Fixes #4621

<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

- [x] Changes are compatible[^1]
- [x] Documentation[^2] completed
- [x] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [x] Unit Tests
    - [ ] Integration Tests
    - [x] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

---------

Co-authored-by: Jesse Rosenberger <[email protected]>
@nicholascioli
Copy link
Contributor Author

I've made it compile again. It also still passes the assemble_batch unit test, which is nice.

I still need to address your comments and fix some hacks, but you should be able to run the router now using the new batching messaging.

abernix and others added 6 commits March 20, 2024 23:44
Each integration tests takes time to link, so this will help with build
times.
It also makes things a bit less chaotic as we can have modules for
integration tests.


<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [ ] Changes are compatible[^1]
- [ ] Documentation[^2] completed
- [ ] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [ ] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.

Co-authored-by: bryn <[email protected]>
Basically working, but lots of hackery...
abernix and others added 2 commits March 22, 2024 12:54
Follow-up to the v1.43.0 being officially released, bringing version
bumps and changelog updates into the `dev` branch.
This commit makes each batch request process concurrently and
independently of each other, allowing portions of a batch to still
succeed even if other portions of the batch do not.
@nicholascioli nicholascioli marked this pull request as ready for review March 22, 2024 22:25
BrynCooke and others added 3 commits March 25, 2024 10:14
To enable correlation between trace and logs trace_id and span_id should
be included on the log messages.

 Json:
 ```

{"timestamp":"2024-03-19T15:37:41.516453239Z","level":"INFO","trace_id":"54ac7e5f0e8ab90ae67b822e95ffcbb8","span_id":"9b3f88c602de0ceb","message":"Supergraph
GraphQL response".....
 ```

 Text:
 ```
2024-03-19T15:14:46.040435Z INFO trace_id:
bbafc3f048b6137375dd78c10df18f50 span_id: 40ede28c5df1b5cc router{
 ```

Configuration options have been added in logging for text and json
formats defaulting to true for json and false for text.

Json:
```
   exporters:
     logging:
       stdout:
         format:
           json:
             display_span_id: true
             display_trace_id: true
```

Text:
```
  exporters:
    logging:
      stdout:
        format:
          text:
            display_span_id: true
            display_trace_id: true
```


<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [ ] Changes are compatible[^1]
- [ ] Documentation[^2] completed
- [ ] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [ ] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.

---------

Co-authored-by: bryn <[email protected]>
Co-authored-by: Edward Huang <[email protected]>
So that all unit tests pass, including: it_will_process_a_non_batched_defered_query

There's probably more follow-up work required here, but this is "good
enough for now".
@nicholascioli nicholascioli requested a review from a team as a code owner March 25, 2024 12:26
@garypen garypen merged commit 7612e34 into garypen/2002-subgraph-batching Mar 25, 2024
10 of 12 checks passed
@garypen garypen deleted the nc/subgraph-batching/messaging branch March 25, 2024 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants