Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@grpc/grpc-js has memory leak issue with latest version #2448

Closed
alanzhao opened this issue May 19, 2023 · 24 comments
Closed

@grpc/grpc-js has memory leak issue with latest version #2448

alanzhao opened this issue May 19, 2023 · 24 comments

Comments

@alanzhao
Copy link

alanzhao commented May 19, 2023

Problem description

We use spanner node js for our api, and the latest grpc-js 1.8.14 has memory leak issues, same as 1.8.13.

Reproduction steps

We use spanner node library for our api. We recently upgraded to node 18. so in package-lock json, grpc-js gets upgraded to 1.18.14. Before upgrade, we were not having the memory leak issue. After upgrade, we have a slow memory leak, manifested in 4-7 days. And our pods restarted due to low memory.

Our stop gap solution is to downgrade @grpc/grpc-js to 1.6.2 and that seems to fix the issue.
Thanks for looking into this. Much appreciated.

Environment

Additional context

We have about 100 pods in prod, and all of them exhibit the symptom after the library gets updated.

@murgatroid99
Copy link
Member

This isn't really enough information to find the problem. Can you provide code that demonstrates this memory leak, preferably without needing to access an external service? Can you provide heap dumps from at least two different points in a process's lifetime that show the memory growth?

Please also check whether you are creating multiple separate client objects. If you are, either change your code to not do that or make sure you are closing them when you are done with them. See #2436, #2402, and #2003 for previous examples of this.

@alanzhao
Copy link
Author

alanzhao commented May 19, 2023

async function getValue({ userId }) {
  const sql = `
    SELECT *
    FRROM ${table}
    WHERE userId = @userId
  `;

  const query = {
    sql,
    params: { userId },
    json: true,
  };

  const [[total]] = await clients.spanner.clientA.run(query);
  return total;
}

module.exports = getValue;

Here is a sample query that we use. We do have a wrapper around spanner and have 3 clients: clients.spanner.clientA, clients.spanner.clientB, and clients.spanner.clientC.

function allClose() {
  const closePromises = [];

  if (clients.spanner) {
    if (clients.spanner.clientA) {
      closePromises.push(clients.spanner.clientA.close());
    }

    if (clients.spanner.clientB) {
      closePromises.push(clients.spanner.clientB.close());
    }

    if (clients.spanner.clientC) {
      closePromises.push(clients.spanner.clientC.close());
    }
  }

which is used in the graphql server graceful shutdown here:

return startApp(app, ready, () => allClose());

The reason why we had 3 clients to spanner is because those 3 clients actually talk to 3 separate spanner databases. Right now, we do not use 2 of them, I can try to remove it and upgrade grpc back to latest version to see if that'll fix the issue. Thanks for the tips.

@murgatroid99
Copy link
Member

Just three clients shouldn't be an issue. The problem I was talking about comes from creating a large number of clients.

@weyert
Copy link

weyert commented May 30, 2023

Yes, I am also experiencing memory leaks with a library that uses this one. Newer versions of google-gax seem to cause out of memory dumps in Node.js v18 in GKE. Downgrading to 3.5.2 improves things. They have this library pinned to ~1.7.0. So maybe it's related to change sinc v1.8?

@murgatroid99
Copy link
Member

murgatroid99 commented May 31, 2023

Do you know how quickly you were leaking memory, in terms of amount of memory leaked per request made or per period of time? An order of magnitude estimate would be enough to help narrow this down.

@murgatroid99
Copy link
Member

Update: we have been unable to reproduce a memory leak with our own tests. Please provide a complete reproduction that leaks memory so that we can investigate this further.

@kruegernet
Copy link

kruegernet commented Aug 2, 2023

I also have this and will try to provide a reproduction when possible/time allows.

I'm coming from the last few years in Go, so IOC/DI frameworks in node are new to me and I believed originally my problem originated there, with unintentionally multiple clients. I'm still open to the notion it could be there in inversify misuse, but I'm more inclined to believe the issue is with grpc-js as with @alanzhao.

@SophiaH67
Copy link

Hi,

I have also found there to be memory leaks originating from this package(also saw some mentions in google-gax at some places in the memory debugger).

From what I can tell, this seems to be an issue with certain timeouts in Detached Node / nghttp2_memory entries:
image

With the following retainers list:
image

This snapshot is using comparison mode in devtools with 1 HTTP request being made between the 2 snapshots, which in our app does do multiple grpc calls.

I personally could not create a minimal reproduction app, even with the knowledge of this existing codebase. I still felt like this would be useful to share, for reasons:

  • Someone else who might end up at this issue, knowing now it might be an issue with a different package often used together with this one.
  • My memory leak dump might be useful somehow, since the retainers list is only files from the @grpc/grpc-js package.
  • Letting OP know you're not the only one having issues with it!

Thank you for your time,

Sophia~

@murgatroid99
Copy link
Member

@SophiaH67 In order to clearly demonstrate a significant memory leak, your test would need to do the following:

  1. Run your test in a loop, to show that memory usage keeps growing.
  2. Ensure that garbage collection runs, to make sure this is a real leak, and not a temporary spike in memory usage. You can do this by running Node with the --expose-gc flag and calling global.gc() before taking a snapshot.

If you do this and you observe similar memory retention in a number proportional to the number, then please share your test code and I can investigate.

@SophiaH67
Copy link

@murgatroid99

Thanks for your time! I should've prefaced this by saying this is my first time debugging memory leaks, so my bad if I miss anything. From what I could tell, the record snapshot button would automatically do garbage collection, but I cannot find a source for that so don't know where I learnt that.

After rerunning my tests in our production app with manual gc() calls through devtools before snapshotting, it does still happen.

In a small reproduction setup however, I could not find any problem with this library. Which leads me to think it's probably something in either a different library, or our own code!

@murgatroid99
Copy link
Member

Have you also confirmed that the memory keeps growing with multiple successive calls? gRPC creates some resources on the first request that it reuses in later requests, so seeing more memory after the first request than before only confirms that, it doesn't show that anything is wrong.

@SophiaH67
Copy link

I have found the issue our team was running into finally!

We were using nestjs's official terminus package for health checks, which created a new channel for every health check attempt.

After doing some digging, it seems that these channels are not able to be garbage collected by NodeJS, due to several self-references(in timeouts and other stuff). This by itself could also be considered a bug, but I'm not too familiar with the garbage collector to solve this.

I did make a PR in terminus to cache the channels it opens, which I built and deployed in our app.

PR: nestjs/terminus#2329

Results:
image

As you can see here, this has totally solved our issue, and explains why I couldn't reproduce it with a minimal setup!

Still thanks to @murgatroid99 for your time and explaining garbage collector stuff to me ❤️

@murgatroid99
Copy link
Member

Calling channel.close() should make it eligible for garbage collection. It looks like terminus did not do that when they stopped using the channel, which would explain why it was retained.

@hvishwas
Copy link

we have seen memory leaks on our services too. The ECS tasks get abruptly killed with 8 RESOURCE_EXHAUSTED: client: xxx. Exceeded rate limit.. The grpc-js version 1.9.0 and node version 18.17.1

@murgatroid99
Copy link
Member

"rate limit" generally refers to a limit on how frequently you can call an API. So, an error that says "Exceeded rate limit" generally refers to excess API call volume, not memory growth.

@hvishwas
Copy link

we have seen memory leaks on our services too. The ECS tasks get abruptly killed with 8 RESOURCE_EXHAUSTED: client: xxx. Exceeded rate limit.. The grpc-js version 1.9.0 and node version 18.17.1

It was a false alarm. We had upgraded our application from Node 14 to Node 18. The restarts were caused by an internal code that was throwing an unhandledPromiseRejection error and led to the application's termination. Apparently, the process termination was introduced in Node 15. https://maximorlov.com/node-js-15-is-out-what-does-it-mean-for-you/

@Thibault2ss
Copy link

Thibault2ss commented Oct 27, 2023

I've been seeing the same behavior @murgatroid99 . Basically my test setup is: in a while (true) loop:

  • instanciate a new grpc client (instead of re-using the same client, purposefully)
  • make a RPC call to a healtcheck method (await it)
  • call client.close()

While I'm doing that, I'm logging every second the process.memoryUsage().heapUsed / 1000000 (in MB), and it slowly grows. Maybe some ref somewhere is preventing the garbage collector to garbage something.

If I run this test script with NodeJS option --max-old-space-size=500, when I see the heap used reaching ~500MB+, then I get a OOM error:

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

My setup:

  • node v18.18.2
  • grpc-js 1.9.7
  • running on a simple ubuntu launched on AWS (arch x86_64)
  • basic replication script:
import { Client, Metadata, credentials } from '@grpc/grpc-js'

// some metrics I want to keep track of
const metrics = {
  frequency: 1000, // 1s
  countSinceLastTick: 0,
  countTotal: 0,
  speed: 0,
}

// this reporter will log the metrics every second
const reporter = setInterval(() => {
  process.stdout.clearLine(0)
  console.log(`processed ${metrics.countTotal}`)
  process.stdout.clearLine(0)
  console.log(`speed ${Math.round(metrics.countSinceLastTick)} calls per sec`)
  process.stdout.clearLine(0)
  console.log(`mem usage ${process.memoryUsage().heapUsed / 1000000}MB`)
  metrics.countSinceLastTick = 0
  process.stdout.moveCursor(0, -3)
}, metrics.frequency)

// this call makes a single rpc request, instanciating a new client + closing it at the end
const makeRequest = async () => {
  const sslCreds = credentials.createSsl(
    Buffer.from(process.env.ROOT_CERT, 'utf-8'),
    Buffer.from(process.env.PRIVATE_KEY, 'utf-8'),
    Buffer.from(process.env.CERT, 'utf-8')
  ) // this is just because in my case I use mTLS

  const options = {
    'grpc.keepalive_time_ms': 120000,
    'grpc.http2.min_time_between_pings_ms': 120000,
    'grpc.keepalive_timeout_ms': 20000,
    'grpc.http2.max_pings_without_data': 0,
    'grpc.keepalive_permit_without_calls': 1,
    'grpc.service_config_disable_resolution': 1,
  }

  const grpcClient = new Client(`your.host.com:8700`, sslCreds, options)

  const deserialize = <T>(argument: T) => argument
  const serialize = (argument: Uint8Array) => Buffer.from(argument)

  await new Promise((resolve, reject) => {
    grpcClient.makeUnaryRequest(
      '/your.Service/Healthcheck',
      serialize,
      deserialize,
      Uint8Array.from([]),
      new Metadata(),
      (err, res) => {
        if (err) {
          reject(err)
        } else {
          resolve(res || Uint8Array.from([]))
        }
      }
    )
  })

  grpcClient.close()
}

// main function
const main = async () => {
  while (true) {
    await makeRequest()
    metrics.countSinceLastTick++
    metrics.countTotal++
  }
}

main()

@murgatroid99
Copy link
Member

@Thibault2ss I think I fixed that with #2606, according to local testing.

@murgatroid99
Copy link
Member

That fix is out in version 1.9.8

@Valeronlol
Copy link

same issue here:
grpc-js version: 1.9.9
nestjs/microservices: 9.4.3
node-version: node:20.9-alpine

@murgatroid99
Copy link
Member

murgatroid99 commented Oct 31, 2023

@Valeronlol The code in #2448 (comment) does not show a memory leak in 1.9.9 in my local testing. If you are experiencing a memory leak, please share code that demonstrates it.

@murgatroid99
Copy link
Member

I am closing this issue, because it has covered too many different memory leaks. If you find this issue because you are experiencing a memory leak in this library, please file a new issue.

@adiun
Copy link

adiun commented Feb 29, 2024

@murgatroid99 thanks for the information about keeping one client.

Is that documented somewhere? I tried looking for it in the grpc-js docs. We were also wondering a while back whether to keep one instance of the client or one for each request and landed on the latter based on some assumptions about how grpc-js works but it seems like we were incorrect. So I was interested to see if there are docs that go into this, and also curious if there are other 'best practices' we should be doing with grpc-js.

@murgatroid99
Copy link
Member

This is a recommendation for gRPC overall, as stated in our performance best practices doc.

It also seems to me that this should be the default assumption for object-oriented programming in general: you create a single object that is responsible for some domain of functionality, and then you call multiple methods on that object to invoke that functionality. And you only create multiple instances when you have some compelling reason to do so, such as needing to use different constructor arguments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants