Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory consumption #1065

Closed
yanns opened this issue May 16, 2022 · 14 comments
Closed

High memory consumption #1065

yanns opened this issue May 16, 2022 · 14 comments
Assignees

Comments

@yanns
Copy link
Contributor

yanns commented May 16, 2022

Describe the bug
The federation router's memory consumption goes very high suddenly.

To Reproduce
We can reproduce this issue on two different environments.

  • On a Kubernetes cluster, we set a maximum of memory.
    After start, during some minutes, the memory consumption is stable.
    Then it suddenly goes high. Kubernetes kills the pod and restarts it.
    Example of memory usage of one pod:
    Screenshot 2022-05-16 at 17 23 48
    We cannot really observe the memory going high, and this goes very fast.

  • Locally, on mac, I can reproduce this behavior with some tests. When the router starts, it consumes about 250 Mi of memory. After some tests, the memory goes very high. Here is one sample.
    Sample of federation-router.txt
    I don't know if it helps. I can try to use http://www.cmyr.net/blog/rust-profiling.html to provide more data.

Expected behavior
The memory consumption can go higher, but should stay stable.

Additional context
I've tried with different versions, and it does not make any difference. We are now using the version v0.9.0.

@yanns yanns added the triage label May 16, 2022
@yanns
Copy link
Contributor Author

yanns commented May 16, 2022

For info, I'm able to check any git version of the router if you want me to try out some changes.

@garypen
Copy link
Contributor

garypen commented May 16, 2022

Hi @yanns. Thanks for filing the issue. Could I request a some additional information:

  • what are you setting the memory limit to? Ideally, I'd like to know: requests.memory and limits.memory
  • approx how many clients are driving the failing instance? 10s, 100s, 1000s ?
  • are there any ERROR or WARN messages in the router logs?

@yanns
Copy link
Contributor Author

yanns commented May 16, 2022

Hi @yanns. Thanks for filing the issue. Could I request a some additional information:

* what are you setting the memory limit to? Ideally, I'd like to know: `requests.memory` and `limits.memory`

I'm using requests.memory = limits.memory = "400Mi"
I've already tried giving several Gi memory, and it does not change anything.
When the router starts consuming more memory, its consumption grows continuously, without any end that I could set.
Locally I have to kill the process that is using always more memory.
It's as if a recursive data structure was starting using itself without any boundary.

* approx how many clients are driving the failing instance? 10s, 100s, 1000s ?

10s

* are there any ERROR or WARN messages in the router logs?

Only WARN about apollo_router_core::plugins::csrf: request is preflighted. (and I've noticed that this was changed to trace recently)

@yanns
Copy link
Contributor Author

yanns commented May 16, 2022

I can reproduce this issue locally with some mutations. One GraphQL query with such mutation is enough to reproduce this behavior.
The federation router never answers, it starts consuming more & more memory, and the client times out at some point.

The mutation contains a field value of type String that contains an escaped string, like: value: "\"aValue\"".
If I use value: "abc", there is no issue. I could reproduce that with different fields of String.

@garypen garypen self-assigned this May 17, 2022
@garypen garypen added bug and removed triage labels May 17, 2022
@yanns
Copy link
Contributor Author

yanns commented May 17, 2022

I've added some println! in the fn subgraph_service to trace the SubgraphResponse.
In normal cases, I can observe those SubgraphResponse being output.
In the issue with the "\"abc\"" string, I cannot observe any response from the subgraph.
The issue might be when parsing the GraphQL mutation.

@yanns
Copy link
Contributor Author

yanns commented May 17, 2022

I've tried to reproduce this case by changing an existing integration test, but no luck so far.

@Geal
Copy link
Contributor

Geal commented May 17, 2022

is a request actually sent to the subgraph (as seen from the nhetwork or the subgraph point of view) or does it actually stop at the router?

@garypen
Copy link
Contributor

garypen commented May 17, 2022

It's almost certainly an issue in the parser. I'm just trying to figure out a good way to reproduce it in our test setup so that I can confirm this.

@yanns
Copy link
Contributor Author

yanns commented May 17, 2022

@Geal The subgraph does not receive any request. It seems to stop at the router.

@garypen
Copy link
Contributor

garypen commented May 17, 2022

That fits with parser issues. Can you give us an example of the mutation you found which is failing (as long as it's not confidential/private)?

@yanns
Copy link
Contributor Author

yanns commented May 17, 2022

The mutation is very simple:

          mutation {
            createStore(draft: {
              name: [{ locale: "en", value: "\"my store\"" }]
            }) {
              name(locale: "en")
            }
          }

I find it strange that I cannot reproduce it with the integration tests.

@garypen
Copy link
Contributor

garypen commented May 17, 2022

I can't reproduce it in the router, but I did the following.

  1. Create a new rust project my-parser which uses apollo-rs to parse the provided mutation.
  2. run my-parser and watch my memory disappear (let it get to 13G before terminating)

Sample program:

fn main() {
    use apollo_parser::Parser;

    let input = r#"mutation {
            createStore(draft: {
              name: [{ locale: "en", value: "\"my store\"" }]
            }) {
              name(locale: "en")
            }
          }"#;
    let parser = Parser::new(input);
    let ast = parser.parse(); // <- This will never return until memory exhausted.
}

I'll debug my test program and figure out what is going on.

@garypen
Copy link
Contributor

garypen commented May 17, 2022

@yanns I've filed an issue against apollo-parser. Once we have a fix available, we'll release an update to the router.

BrynCooke pushed a commit that referenced this issue May 17, 2022
BrynCooke pushed a commit that referenced this issue May 17, 2022
BrynCooke pushed a commit that referenced this issue May 17, 2022
BrynCooke added a commit that referenced this issue May 17, 2022
…en `"` is used in a parameter value. (#1078)

Co-authored-by: bryn <[email protected]>
@yanns
Copy link
Contributor Author

yanns commented May 18, 2022

Thanks for the quick release.
I've tested with https://github.com/apollographql/router/releases/tag/v0.9.1, and the issue is now fixed. ❤️

@yanns yanns closed this as completed May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants