Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subscriber maxes out CPU after exactly 60 minutes on GKE #890

Closed
ctavan opened this issue Feb 19, 2020 · 24 comments
Closed

Subscriber maxes out CPU after exactly 60 minutes on GKE #890

ctavan opened this issue Feb 19, 2020 · 24 comments
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@ctavan
Copy link
Contributor

ctavan commented Feb 19, 2020

Environment details

Steps to reproduce

  1. Start a pod that subscribes to an idle subscription
  2. Wait 60 minutes
  3. Observe node process eat up all available CPU

Screenshot 2020-02-19 12 54 29

I have a pod that subscribes to a subscription which is 100% idle (i.e. no messages published on that topic) during the observation period.

Exactly 60 minutes after the pod started it eats up all available CPU until the pod is deleted manually to force a restart.

This behavior does not appear when using the C++ gGRPC bindings as suggested in #850 (comment)

I have also enabled debug output through

GRPC_TRACE=all
GRPC_VERBOSITY=DEBUG

I've put the full debug output (as fetched from Google Cloud Logging) of the entire lifetime of the pod in this gist.

There is no further output after the last line even though the pod kept running for a while as can be seen form the graph above: the bump in CPU usage is the time where the pod eats up all available CPU and doesn't seem to log anything. (Times in the logs correspond to the times in the graph above with 1h timezone difference, so 10:51 in the logs = 11:51 in the graph).

I can reproduce this problem with any of my subscriber pods.

I believe it is very likely that this problem also falls into the category of #868.

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Feb 19, 2020
@hefekranz
Copy link

same is happening to us. Thought it was a bad implementation...

@codersbrew
Copy link

We are also seeing the same issue. We've been having a lot of strange behavior with this library.

@feywind
Copy link
Collaborator

feywind commented Feb 19, 2020

Thanks everyone, especially for that log output! I find this to be particularly suspicious:

2020-02-19T10:10:14.382Z | call_stream | [6] write() called with message of length 0

I'm working on trying to build some kind of repro to pass on to the grpc team.

We have only one Node library that's defaulting to the gRPC C++ bindings at this point (spanner), so there has been some hesitation about switching this one back to it by default, but it also seems to be affecting a lot of people. (Also, the C++ one requires node-gyp, which may not be available in some environments, so adding a hard dependency to it is possibly problematic.) The current thought is that it may have something to do with this library's more extensive use of streaming APIs. So progress is being made there. Thanks for the patience and debugging help!

@feywind feywind added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Feb 19, 2020
@crivera
Copy link

crivera commented Feb 19, 2020

We are seeing a similar issue where our node process gets killed if a subscription has been idle for 60 minutes - without any logs.

@murgatroid99
Copy link

Can you clarify when you restarted the process relative to the final timestamped log message? Was the log cut off because you restarted, or was there a time gap after the last log message and before the manual restart?

@murgatroid99
Copy link

The log you posted shows that the grpc library establishes a connection and some requests are made, then exactly an hour later the connection drops and a new connection is established (this is a normal thing that the grpc library is designed to handle). Then, right after the new connection is established, the log cuts off, saying that the process was restarted. I'm wondering if you can see timestamps for when GKE restarts the process, and when that is relative to the log lines in the file.

The reason is that I'm wondering if the problem I should be looking for is more like "something weird happened while reconnecting and that triggered GKE to restart it" or "the process entered an infinite loop without logging anything, and GKE eventually detected it".

@ctavan
Copy link
Contributor Author

ctavan commented Feb 20, 2020

Can you clarify when you restarted the process relative to the final timestamped log message? Was the log cut off because you restarted, or was there a time gap after the last log message and before the manual restart?

I only restarted about 20 mins after the last log message (see the graph in the original post, where the bump in CPU usage goes down again).

Between the last log message in my gist and the restart there were no further log messages retrievable from kubectl or Google cloud logging and the process consumed all CPU that Kubernetes allowed.

@ctavan
Copy link
Contributor Author

ctavan commented Feb 20, 2020

The log you posted shows that the grpc library establishes a connection and some requests are made, then exactly an hour later the connection drops and a new connection is established (this is a normal thing that the grpc library is designed to handle). Then, right after the new connection is established, the log cuts off, saying that the process was restarted. I'm wondering if you can see timestamps for when GKE restarts the process, and when that is relative to the log lines in the file.

The reason is that I'm wondering if the problem I should be looking for is more like "something weird happened while reconnecting and that triggered GKE to restart it" or "the process entered an infinite loop without logging anything, and GKE eventually detected it".

My observation is: after 60 minutes of repeated log output every 30 seconds all log output stops and the node process jumps to max available CPU usage for no apparent reason. The log output in the gist shows everything retrieved from that pod which formally lived about 20 minutes longer than the last line of logs.

@ctavan
Copy link
Contributor Author

ctavan commented Feb 20, 2020

Just to make sure we're not missing out any details, our initialization logic is like this:

const pubsub = new PubSub();
const subscription = pubsub.subscription(subscriptionName, {
 flowControl: {
   maxBytes: Math.floor(getenv.int('MEMORY_LIMIT_BYTES') * 0.5),
 },
});
subscription.on('message', handleMessage);

where MEMORY_LIMIT_BYTES comes from the Kubernetes pod spec resource limits.

@ctavan
Copy link
Contributor Author

ctavan commented Feb 20, 2020

@murgatroid99 I tried it with early versions of @grpc/grpc-js and the problem only appears with @grpc/[email protected]. It did not appear with 0.6.12, 0.6.14 and 0.6.15.

I have updated the gist to include the log output of the same pod with the earlier versions of grpc-js.

@murgatroid99
Copy link

I have published @grpc/grpc-js version 0.6.17 with a possible mitigation for the issue and some more logging that should hopefully be relevant. Can you try that out and share the resulting logs?

@ctavan
Copy link
Contributor Author

ctavan commented Feb 21, 2020

Whoops, I have troubles downloading the logs, there are megabytes of:

2020-02-21T20:23:04.844Z | channel | Failed to start call on picked subchannel 74.125.140.95:443. Retrying pick
2020-02-21T20:23:04.861Z | channel | Pick result: COMPLETE subchannel: 74.125.140.95:443 status: undefined undefined
2020-02-21T20:23:04.880Z | channel | Failed to start call on picked subchannel 74.125.140.95:443. Retrying pick
2020-02-21T20:23:04.884Z | channel | Pick result: COMPLETE subchannel: 74.125.140.95:443 status: undefined undefined
2020-02-21T20:23:04.889Z | channel | Failed to start call on picked subchannel 74.125.140.95:443. Retrying pick
2020-02-21T20:23:04.904Z | channel | Pick result: COMPLETE subchannel: 74.125.140.95:443 status: undefined undefined
2020-02-21T20:23:04.923Z | channel | Failed to start call on picked subchannel 74.125.140.95:443. Retrying pick
2020-02-21T20:23:04.940Z | channel | Pick result: COMPLETE subchannel: 74.125.140.95:443 status: undefined undefined

during the period where the process runs with maxed out CPU. So basically tryPick seems to run in an infinite loop.

I can highly recommend not running @grpc/grpc-js version 0.6.17 with debug flags enabled 😉

(Edited to include multiple subsequent log outputs to illustrate that they are only milliseconds apart)

@ctavan
Copy link
Contributor Author

ctavan commented Feb 21, 2020

@murgatroid99 I managed to download 140MB of logs from Google Cloud Logging and put the part until the infinite loop begins into the gist.

@feywind
Copy link
Collaborator

feywind commented Feb 21, 2020

@ctavan Thanks for the continued debug info! This actually is starting to sound like something we were talking about as a possibility for a cause. Basically, after some time (I think it might've been 30 minutes) the stream is closed, and we attempt to reconnect it. If something prevents that connection, it's not supposed to busy loop reconnecting, but that might be what it's doing.

A gRPC fix might be useful if that's what it is (this guy - grpc/grpc-node#1271) but we also talked about putting in a workaround in the nodejs-pubsub library if that proves more elusive.

@murgatroid99
Copy link

I had a feeling that the problem was something like that, but I tried to mitigate a slightly different potential issue. I didn't realize it would loop that tightly and generate that many log lines, though I do want to note that in normal circumstances, when you're not encountering this bug, you'll usually get about one extra log line per request.

What's happening here is that a connection has been established, but something is in an invalid state that is preventing it from actually starting a stream, so it's just repeatedly retrying. The reason it only happens on 0.6.16 and above is that it's caused by grpc/grpc-node#1251, which was supposed to fix googleapis/nodejs-firestore#890. I'll try to get a new version out on Monday that fixes this. For now, if you can force it to use 0.6.15 you'll probably be fine.

@murgatroid99
Copy link

I have now published @grpc/grpc-js version 0.6.18. I'm pretty sure I've fixed it this time, and if not I added some more relevant logging. Can you try that out?

@ctavan
Copy link
Contributor Author

ctavan commented Feb 25, 2020

@murgatroid99 I've upgraded to @grpc/[email protected] and the specific problem described in this issue is now gone. Great work!

For the sake of completeness I have added the log output of the (still running since 160 minutes) pod to the gist.

I'll leave this issue open in case you want to update the transitive dependencies of @google-cloud/pubsub so that it pulls at least 0.6.18 of @grpc/grpc-js.

@feywind
Copy link
Collaborator

feywind commented Feb 25, 2020

@ctavan I will take a look at updating the versions in nodejs-pubsub. Thank you for all the help in testing this stuff!

@jperasmus
Copy link

Had a similar issue and switched to the older C++ grpc client in the meantime. It seems to be behaving.

@feywind
Copy link
Collaborator

feywind commented May 5, 2020

@jperasmus Did you try out the latest version of the library lately, which pulls in grpc-js 0.6.18? That library released 1.x recently, and I'm planning to pull that in after pubsub has its 2.0 release.

@jperasmus
Copy link

@feywind I did try with the latest version at the time, which was after Christoph posted about 0.6.18. I will try with the latest v1 grpc-js client on one of our test environments after v2 of this library includes it.

@feywind
Copy link
Collaborator

feywind commented Jul 21, 2020

@jperasmus Checking back in - have you had better luck with the grpc-js client 1.x? That should be picked up here now (by way of gax updates).

@jperasmus
Copy link

Hi @feywind thanks for following up. I actually moved to a different company where we are not using this package anymore, but before I left, I did run an experiment for almost 2 weeks with the v1 grpc-js client on a test environment and it was substantially more stable than the old grpc client. Basically, where the grpc client would throw at least one connection issue in production per day, the grpc-js client had no connection issues in the test environment under similar, but slighter less, load than that of production. As far as I am aware, grpc-js is running in production now for the old company.

@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Aug 17, 2020
@feywind
Copy link
Collaborator

feywind commented Aug 18, 2020

Okay, cool! If anyone is still running into this with grpc-js, feel free to re-open.

@feywind feywind closed this as completed Aug 18, 2020
@feywind feywind removed the 🚨 This issue needs some love. label Aug 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

8 participants